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Abstract ' 0 1 NOV. 2002 

Modtaget 

Bladder cancer is a common malignant disease characterised by frequent recurrences 1,2 . 
Important factors determining the disease course of the individual patient are the stage of 
disease at diagnosis and the presence of surrounding carcinoma in situ*. Despite significant 
efforts, no accepted immunohistological or molecular markers define clinically relevant 
subsets of bladder cancer. Here we report the identification of clinically relevant subclasses of 
bladder carcinoma using expression microarray analysis of 40 well-characterised bladder 
tumours. Hierarchical cluster analysis identified the three major stages (Ta, Tl and T2-4) and 
the Ta tumours were furthermore separated into well defined subgroups. We built a 32 gene 
molecular classifier using a cross validation approach, which classified benign and muscle 
invasive tumours with close correlation to pathological staging. The classifier provided new 
predictive information on disease progression in Ta tumours (P<0.005). Other classifiers 
contained up to 320 genes and had similar good performance. To delineate non-recurring Ta 
tumours from frequently recurring Ta tumours we analysed expression patterns in 31 
tumours by applying a supervised learning classification methodology, which classified 75% 
of the samples correctly (P<0.006), Furthermore, gene expression profiles characterising each 
stage and subtype demonstrated their biological properties and form new potential targets for 
therapy. 

Introduction 

Bladder cancer in the form of transitional cell carcinomas is a common malignant disease 
characterized by frequent recurrences. An important factor determining the disease course of the 
V, patient is the stage of disease at diagnosis. Patients presenting with relatively harmless stage Ta 
]fp superficial papillomas will have recurrences in 50% of cases but less than 10% will later on develop 
an invasive tumor. On the other hand the tumors that show a superficial invasion into submucosa, 
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stage Tl, have a recurrence rate of 70% and 30% of those patients will later develop a muscle 
invasive tumor. Finally, about 25% of patients present with an invasive stage T2 -4 tumor at 
diagnosis 1 . Another epithelial abnormality influencing the disease course is the possible presence 
of dysplasia or carcinoma in situ in the mucosa surrounding the tumor. Patient having such field 
disease have much more frequent recurrences and a relatively poor prognosis, as 37% die within 10 
years 2 . 

DNA fingerprinting as well as comparative genomic hybridization (CGH) have demonstrated 
that metachronous bladder tumors are of the same clonal origin 3 ' 4 .However, it is still not 
understood how a stage Tl tumor in the left side of the bladder mucosa can share clonal origin with 
a stage T2 tumor occurring in the right side after a purported tumor free interval of more than one 
year. Theories on implantation or seeding of tumor cells exist but have never been proved 4 . CGH 
technology has also shown that the superficial stage Ta and mucosa invasive Tl tumors, although 
they microscopically may look similar, have a quite different chromosomal integrity 5 . Stage Tl 
tumors show many more losses and gains of chromosomal materials than do stage Ta tumors. This 
has led to the suggestion that stage Ta and stage Tl tumors represent clinically different diseases 6 . 

Recent advances in microarray technology have made it possible to characterize cancers based 
on the expression of thousands of genes. Parallel gene expression monitoring is a powerful tool for 
the analysis of the relation between tumors, for discovering new tumor subgroups (class discovery), 
for assigning tumors to pre-defined classes (class prediction), and for identifying co-regulated or 
tumor stage specific genes 7 ' 11 . In a recent study of bladder cancer, we demonstrated functional 
groups of genes whose co-regulation formed the basis for separating bladder tumors into superficial 
and muscle invasive tumors ,2 . 

Here we used microarrays with approximate 5000 full-length genes to analyze gene 
k expression in 40 bladder tumors selected from a very large clinical specimen bank holding more 
than 35.000 samples from bladder cancer patients, prospectively followed for up to six years. The 
selection was based on the disease course, stage, grade, concomitant carcinoma in situ, and 



recurrence frequency, in such a way that the selected tumors represent the spectrum from harmless 
stage Ta grade 2 superficial papillomas to muscle invasive stage T2 grade 4 tumors. Our data 
demonstrate a distinctly different gene expression in Ta tumors that separate these into three groups, 
relatively harmless Ta grade 2 tumors, frequently recurring stage Ta grade 3 tumors, and stage Ta 
grade 3 tumors with surrounding carcinoma in situ that cluster together with the invasive tumors. 
The arrays identified even minor histological alterations as the presence of areas of squamous 
metaplasia in invasive tumors, or the presence of carcinoma in situ.. Co-regulated groups of genes, 
such as genes related to proliferation, immune response and transcription, being up- or down 
regulated at certain stages and grades, describe the cell biological events that characterize each of 
the clinically well-known bladder tumor stages. Finally, from a set of 30 to 320 classifying genes 
we classified the tumor samples with close correlation to the pathological staging, plus obtained 
additional information on progression of disease and recurrence of tumors,' as well as presence of 
cacinoma in situ. 

Results 

From our bladder cancer specimen bank we selected tumors of different histological stages and 
grades from six groups of patients (Table 1): (a) 5 patients with pT a grade II tumors (no recurrence); 
(b) 5 patients with pT 0 grade HI tumors (no prior pTi tumor or CIS); (c) 5 patients with pT a grade 
III tumors (CIS but no prior pTi tumor); (d) 4 patients with pT a grade III tumors (a prior pTi tumor 
and CIS); (e) 1 1 patients with pTi grade III tumors (no prior pT 2 + tumor); and (0 1 0 patients with 
primary invasive pT 2 + grade TWIW tumors. See Supplementary Information; Table 1 for complete 
disease course. In total 40 preparations of RNA from tumor and 4 from normal urothelial tissue 
were labeled and hybridized to Affymetrix oligonucleotide microarrays with approximately 5000 
full-length genes. Scanning identified the expression level of the genes utilizing antibody 
amplification of weakly expressed genes. Genes that did not vary throughout the data-set as e.g. 



housekeeping genes were eliminated, and only the 1767 genes (26 %) that showed an expression 
level change in tumor tissue compared to normal urothelium were subjected to cluster analysis. 
Sample clustering 

A two-way hierarchical clustering of the tumor samples based on the 1767 gene-set remarkably 
separated all 40 tumors according to stages and grades with only few exceptions (Fig. la). Two main 
branches holding the superficial pTa tumors and the invasive pTl and pT2+ tumors, respectively, 
were identified. In the superficial branch two sub-clusters of tumors could be identified, one holding 
8 tumors that had frequent recurrences and one holding 3 out of the five pTa grade 2 tumors with no 
recurrence. In the invasive branch it was remarkable to find four pTa grade 3 tumors clustering 
tightly with the muscle invasive T2 tumors. These pTa tumors showed concomitant carcinoma in 
situ in the surrounding mucosa. This indicates that this sub-fraction of pTa tumors have some of the 
more aggressive features found in muscle invasive tumors. The pTl cluster could be separated into 
three sub-clusters one holding four tumors including a pTa tumor, of whom 2 had CIS, and two 
others with no clear clinical difference. The one stage pTl grade 3 tumor that clustered with the 
stage pT2+ muscle invasive tumors was the only Tl tumor that showed a solid growth pattern, the 
other were papillomas. Nine out of ten pT2+ tumors were found in one single cluster. As another 
technique to demonstrate the remarkable separation of the tumors we used multidimensional scaling 
analysis (Fig. 1c). 

In an attempt to reduce the number of genes needed for class prediction we identified those genes 
that were scored by the Cancer Genome Anatomy Project as belonging to cancer-related groups 
such as tumor suppressors, oncogenes, genes involved in DNA-damage, angiogenesis, apoptosis, 
cell cycle, cell behavior, cell signaling, development, gene regulation, and transcription. These 
genes were then isolated from the initial 1767 gene-set and those 88, which showed largest variation 
(SD of the gene vector > 55S 4), were used for hierarchical clustering of the tumor samples. This gene- 
set of only 88 genes was able to identify the clinically relevant groups almost as exact as the 1767 
gene-set (Fig. lb). This finding emphasizes that the tumor clustering is not simply reflecting larger 
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amounts of stromal components in the invasive tumor biopsies. The frequently recurring Ta grade 3 
tumors clustered two by two in four separate clusters. The four pTa grade 3 tumors surrounded by 
CIS were stilt located inside the invasive branch. One Ta tumor (1 166-1) that clustered as a Tl 
tumor using 1 767 genes repeated this position with the small 88 gene-set. It cannot be ruled out that 
this tumor in reality is a Tl grade 3 tumor. 
Gene clustering 

Hierarchical cluster analysis of the 1767 genes revealed several characteristic profiles in which 
there was a distinct difference between the tumor groups (Fig. 1 d, black lines identifying clusters A 
to J). 

Cluster A contains genes that show low expression in normal urothelium and stage Tl tumors, a 
medium level in stage T2 and a very high level in all the Ta grade 3 tumors (Fig.2a). This cluster 
contains 8 transcription factors as well as other nuclear genes related to transcriptional activity (See 
Supplementary Information; Figure 1 & 2 for enlarged views of cluster A-J). The high 
transcriptional activity may be related to both a high metabolic activity as well as an increased cell 
proliferation. Although not identical with the distribution of the proliferation cluster (cluster C) 
these two clusters show a high degree of similarity. 

In Cluster B a high level of expression is seen in Ta grade 3 tumors with frequent recurrences and 
with Cis but not in the more indolent Ta grade 2 tumors. This cluster contains 1 1 genes that encode 
nuclear proteins, such as alpha polymerase, RAD 21, Rbl and topoisomerase II binding protein. 
Cluster C contains genes that are up regulated in both Ta grade 3 with high recurrence rate and CIS, 
in T2 muscle invasive tumors and in half of the Tl tumors. This cluster show a remarkable tight co- 
regulation of genes related to cell cycle control and mitosis (Fig.2c). Cyclins, PCNA as well as a 
number of centromer related proteins are represented in this cluster. 
Kt Cluster D holds genes that show a lower than normal expression in muscle invasive stage T2 tumors 
v and Ta grade 3 tumors with Cis, and relatively higher expression in Ta grade 3 and Tl tumors. 
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Some interesting genes in this cluster are keratin 8 and 19, E-cadherin, Integrin beta 4 and beta 6, 

and the EGF related genes eib-B2, erb-B3 and EGF receptor pathway substrate 8. 

Cluster E holds genes that have a very high expression in Ta grade 2 and 3 without Cis. Among 

those we find two homeobox proteins Al and AS, an Insulin like growth factor receptor and Von- 

Hippel Lindau syndrome protein, as well as an ngi- inducible anti-proliferative protein. 

Cluster F shows a tight cluster of genes related to keratinization (Fig. 3). Only two tumor samples 

(875-1 and 1 178-1) show a very high expression of these genes that include keratins 6A, 6B, 

14,16,17, small prolin rich proteins 1 A and B and 2A and B. A re-evaluation of the pathology slides 

revealed that only the two samples with high levels of these genes had epidermoid metaplasia. 

Thus, this cluster of genes explains the gene activation leading to squamous metaplasia as 

frequently seen by light microscopy in invasive bladder tumors. 

Cluster G holds genes that are up-regulated in T2 tumors and have a remarkably consistent high 
expression level in the Ta grade 3 tumors with Cis that cluster in the invasive branch (Fig. 2g). The 
cluster is characterized by high levels of genes related to the stroma such as laminin, myosin, 
caldesmon, collagen, dystrophin, fibronectin, and endoglin. The increased transcription of these 
genes may indicate a remodeling of the stroma that could reflect signaling from the tumor cells 
(connective tissue growth factor is included in the cluster) or from infiltrating lymphocytes. It is 
remarkable that these genes are those that most clearly separate the Ta grade 3 tumors surrounded 
by Cis from all other Ta grade 3 tumors. 

Cluster H is seen as a continuation of cluster G, and like that houses a number of stroma related 
genes like myosin, tropomyosin, decorin, procollagen and collagens. The prevalence in this cluster 
of highly expressed genes in both normal biopsies and invasive tumors could indicate that this 
cluster is reflecting the amount of stroma in the biopsy as that is generally more richly represented 

w 

in those biopsies. 

v Cluster I includes genes that are lower in expression in Tl and Ta tumors than in normal urothelium 
as well as invasive tumors. It contains a large number of genes related to the immune system such 
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as MHC genes, Interieukin receptors, and immunoglobulins. It could be regarded as a measure of 
the immune response against the tumor, however, the normal biopsies and the muscle invasive 
tumors look very much alike indicating that it might be a reflection of the amount of stroma in the 
biopsy. As the level is low in papillomas it cannot be ruled out that papillomas show a reduced 
immune response for some unknown reason. However, that has to be proven by micro dissection 
approaches, if that can be done without reducing the RNA quality. 

Cluster J includes genes that are highly expressed in invasive tumors, and to some extend in Ta 
grade 3 with Cis. It houses protease related genes like Matrix metalloproteinase 2 and 9, 
plasminogen activator urokinase receptor, and urokinase, as well as the cytokine related genes, TNF 
alpha induced proteins 3 and 6, IL6 and CSF 1, and finally GR02 and 3 oncogenes. We hypothesize 
that this cluster is related to the invasive process, however, it is remarkable that the Ta grade 3 
tumors with Cis have such a high matrix degrading activity as these tumors have not yet passed the 
basal membrane. One might suggest that this activity is favoring break down of the basal membrane 
as well as a fast invasive process when the tumor cells once pass through this. Seen in this light, this 
cluster may explain why the patients having Cis lesions have such a poor prognosis. 

Prediction of bladder tumor stages, generation of a classifier. 

An objective class prediction of bladder tumors based on a limited gene-set would be desirable, and 
could be of potential clinical use. We decided to build a classifier using tumors correctly classified 
in the three main groups as identified in the cluster dendrogram (Fig. la). Consequently, the 
classifier is based on expression-patterns rather than pathological staging. 
We used a maximum likelihood classification method with a cross-validation scheme where one 
test tumor was removed from the set and a set of predictive genes were selected from the remaining 
^ tumor samples for classifying the test tumor. This process was re-iterated for all tumors. Predictive 
* genes that showed the largest possible separation of the three groups were selected for 

classification, and each tumor was classified according to how close it was to the mean in the three 
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groups (Fig, 4). We classified tumor samples using predictive gene-sets ranging from 10 to 320 
genes (Supplementary Information; Table 2). Classification using 80 predictive genes showed the 
best correlation to pathological staging, more or fewer predictive genes included in the classifier 
distorted the correlation (Table 1). 

Three of the four pTa gr3 tumors with surrounding CIS that clustered as T2+ tumors were classified 
as T2 and one failed the 5% difference limit (Ta/Tl). The solid pTl tumor (1257-1) that clustered 
with the muscle invasive tumors was classified as a Tl and the pTa gr3 tumor (1 166-1) that 
clustered with the Tl tumors was classified as a Ta tumor. However, the muscle invasive pT2+ 
tumor (937-1) previously found in the Tl cluster was also classified as a Tl tumor This was also 
the case for tumor 1 164-1 . It is obvious that the Tl tumors were close to both Ta and T2 tumors, 
thus forming an intermediate between them (Fig. 4). 

Discussion 

In this paper we show that applying hierarchical two way clustering to very well characterized 
clinical specimens can lead to an exact prediction of known and as well as new clinically relevant 
tumor classes. The specimens were characterized by common pathology features as stage and grade, 
but also by information on surrounding carcinoma in situ and recurrence pattern through several 
years. We identified a subset of superficial Ta grade 3 tumors with surrounding Carcinoma in situ 
having properties in common with muscle invasive tumors and indeed clustered together with these. 
Furthermore, we could distinguish the group of non-recurrent superficial Ta grade2 tumors from Ta 
grade 3 tumors whit frequent recurrences. 

In each class of tumors we identified clusters of genes suggesting some important properties 
of these classes. For example, we identified a highly increased level of gene transcription factors in 
Ta grade 3 tumors with frequent recurrences. Three of these transcription factors (TFDP1, TFDP2, 
v and GTF2H4) are involved in cell cycle regulation. In the proliferative cluster that was most 
prominent in Ta grade 3 with CIS and muscle invasive tumors it was remarkable to observe the 
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many genes related to chromosomal segregation in mitosis. Genes like mitotic kinesin-like protein 
1, CDC47, mitotic centromere associated kinesin, centromere protein A, E, and F, and kinesin-like 
protein 1 all had an up-regulated expression. Whether this is simply reflecting increased cell 
proliferation, or relates to the well-known aneuploidy found even at early stages of bladder cancer is 
not known. We do know from a previous study that there are no mutations in the genes related to 
the anaphase promoting complex, thus a change in expression of genes related to the centromere 
function offers an alternative explanation that deserves further exploration. These gene products at 
either RNA or protein level could form important new targets for drug therapy, using for example 
small molecules that could penetrate the cell wall and exhibit an inhibitory binding to these 
molecules. 

Another important discovery was a cluster of genes related to the stroma and probably 
indicating stromal remodelling. This cluster was by far most up-regulated in pTa grade 3 tumors 
with CIS and to almost the same extend in muscle invasive tumors. It contained genes like laminins, 
hexabrachio, fibulin, myosins, caldesmons, dystrophin, endoglin, collagens IV, V, XV and XVIII, 
integrins, fibronectin, cadherin, moesin and connective tissue growth factor. 

The number of genes used to identify the important clinical classes was originally 1767 but 
sorting out the genes that were oncology related it could be reduced to only 88 genes. Interestingly, 
the 88 genes defined three major branches, a Ta, Tl and T2 branch. As with the larger number of 
genes the T2 branch included the Ta grade3 tumors with CIS. These data points to the'fact that it 
seems possible to classify bladder tumors using a restricted number of genes on a bladder cancer 
microarray. The smaller number is needed to avoid too much irrelevant noise, and makes 
interpretation much easier 

Encouraged by this finding we decided to test the strength of suing our gene set as a classifier 
^ for bladder cancer samples. Instead of using the pathological staging groups directly we used the 
v three main groups of tumors identified by the cluster analysis. Because of the limited amount of 
samples in each group we used a cross-validation scheme for classifying the tumors. The obtained 
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classification results showed large similarities to pathological staging when using 80 predictive 
genes. Furthermore, three of the four Ta gr3 tumors with surrounding CIS, which in the cluster 
analysis was found close to muscle invasive tumors, were classified as T2 tumors. This is in 
agreement with the higher risk of disease progression in these patients. In addition, the two muscle 
invasive tumors (937-1 and 1 164-1) classified as Tl tumors were from patients that are still alive 
after 3 and 2 years respectively. 

It will be interesting in the future to follow up on these patients with the aim of evaluating 
whether the subclasses of Tl and T2 tumors that could be identified hold information on the 
response to treatment. However it may be more likely that a complete different data set will be 
needed to generate markers that will predict treatment response. 

A commonly observed phenomenon in muscle invasive bladder cancer is squamous metaplasia. 
Pure squamus cell tumors are relatively rare and have a very poor prognosis with more than 50% of 
the patients dying within one year [ref|. The two tumors with squamous metaplasia demonstrated 
clearly some of the genes that are activated in this process, keratins 6A, B 14 , 16, 17 and small 
proline rich proteins 1 A, B and 2B. This corresponds to previous data based on 2-D-gels showing 
the keratins 6, 14, 16, and 17 highly expressed on the protein level in squamous carcinomas 13 . 
Furthermore, the small proline rich proteins are present in squamous tissues 14 . Whether the 
metaplasia is a favorable or unfavorable finding for the disease outcome is not described. 

It was interesting that we did not observe systematic alteration in genes related to apoptosis. 
Reduced apoptosis is supposed to be of major importance in the malignant process as demonstrated 
in xx cancer by alterations of yy apoptosis related proteins. However, very few apoptosis related 
genes showed changes in the bladder tumors and none of these in a systematic way. Whether this 
indicate that apoptosis is of relatively less importance in bladder cancer or that apoptosis is blocked 
_^ due to inactivating mutations cannot be answered based on the present data. It also emphasizes the 
v fact that we are only registering the level of transcripts by using microarrays. We obtain no 

information on the quality of the transcripts. These may be harboring inactivating mutations or may 
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be splice variants without biological function; this aspect should always be born in mind when 
interpreting microarray data. 

Previous publications have demonstrated the difference between benign and malignant disease 
e.g. in the prostate and in the breast. However, this is the first paper to utilize cluster analysis to 
identify new important classes in a common epithelial carcinoma disease. This was only possible 
due to the very well characterized clinical material and revitalize the notion that although we have 
highly sophisticated technologies at hand now, it is still of the utmost importance, and maybe even 
more important now when thousands of data are obtained from one specimen, that the quality of the 
specimens to be analyzed is superior. 

The very precise class prediction obtained by hierarchical cluster analysis in the present paper 
is remarkable when taking into account the complete lack of clustering according to stage and grade 
in clear cell renal carcinomas as recently published 11 . In prostate and colon cancer is was possible to 
separate benign and malignant diseases 15 ' 16 , however, more detailed classification of samples taking 
into account the disease course and in colon the Dukes stages are yet to come. 

We are now able to identify gene clusters that can be used to classify bladder tumors, not only 
to existing stages and grades but also taking into account surrounding carcinoma in situ and the 
recurrence pattern. Fabrication of microarrays with the purpose of stratifying patients for specific 
treatment options is now a possibility. 

Methods 

Biological material. 40 bladder tumor biopsies were sampled from patients following removal of 
the necessary amount of tissue for routine pathology examination. The tumors were frozen 
immediately after surgery and stored at -80°C in a guanidinium thiocyanat solution. All tumors 
^ were graded according to Bergkvist ei al. 17 and re-evaluated by a single pathologist. As normal 
•» urothelial reference samples we used a pool of biopsies as well as three single biopsies from 
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patients with prostatic hyperplasia or urinary incontinence. Informed consent was obtained in all 
cases and protocols were approved by the local scientific ethical committee. 
RNA purification and cRNA preparation. Total RNA was isolated from crude tumors biopsies 
using a Polytron homogenisator and the RNAzol B RNA isolation method (WAK-Chemie Medical 
GmbH). 10 ug total RNA was used as starting material for the cDNA preparation. The first and 
second strand cDNA synthesis was performed using the Superscript Choice System (Life 
Technologies) according to the manufacturers instructions except using a oligo-dT primer containing 
a T7 RNA polymerase promoter site. Labeled cRNA was prepared using the BioArray High Yield 
RNA Transcript Labeling Kit (ENZO). Biotin labeled CTP and UTP (Enzo) were used in the reaction 
together with unlabeled NTP's. Following the IVT reaction, the unincorporated nucleotides were 
removed using RNeasy columns (Qiagen). 

Array hybridization and scanning. 15 ug of cRNA was fragmented at 94°C for 35 min in a 
fragmentation buffer containing 40 mM Tris-acetate pH 8.1, 100 mM KOAc, 30 mM MgOAc. Prior 
to hybridization, the fragmented cRNA in a 6xSSPE-T hybridization buffer (1 M NaCl, 10 mM Tris 
pH 7.6, 0.005% Triton), was heated to 95°C for 5 min and subsequently to 40°C for 5 min before 
loading onto the Affymetrix probe array cartridge. The probe array was then incubated for 16 h at 
45°C at constant rotation (60 rpm). The washing and staining procedure was performed in the 
Affymetrix Fluidics Station. The probe array was exposed to 10 washes in 6xSSPE-T at 25°C 
followed by 4 washes in 0.5xSSPE-T at 50°C. The biotinylated cRNA was stained with a streptavidin- 
phycoerythrin conjugate, final concentration 2 ug/ul (Molecular Probes, Eugene, OR) in 6xSSPE-T 
for 30 min at 25°C followed by 10 washes in 6xSSPE-T at 25°C. An antibody amplification step was 
added using normal goat IgG final concentration 0.1 mg/ml (Sigma) and Anti-streptavidin antibody 
(goat) biotinylated final concentration 3 ug/ml (Vector Laboratories). This was followed by a staining 
step with a streptavidin-phycoerythrin conjugate, final concentration 2 ug/pj (Molecular Probes, 
Eugene, OR) in 6xSSPE-T for 30 min at 25°C and 10 washes in 6xSSPE-T at 25°C. 
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The probe arrays were scanned at 560 nm using a confocal laser-scanning microscope with an argon 
ion laser as the excitation source (Hewlett Packard Gene Array Scanner G2S00A). The readings from 
the quantitative scanning were analysed by the Asymetrix Gene Expression Analysis Software. 
Data analysis. All chips were scaled to a global intensity of ISO units. Expression level ratios 
between tumors and the normal urothelium reference pool were calculated using the comparison 
analysis implemented in the Asymetrix GeneChip software. In order to avbid expression ratios 
based on saturated gene-probes we used the antibody amplified chip-data for genes with an average 
AvgDiff value below 1000 and the non-amplified data for genes with values equal to or above 1000 
in average AvgDiff value. We applied different filtering criteria to the expression data in order to 
avoid including non-varying and non-measurable genes in the data analysis. First, only genes, 
which showed significant changes ("Increase" or "Decrease" calls) in expression levels compared to 
the normal reference pool in at least three samples, were selected Second, only genes with at least 
three "Present" calls across all experimental samples were selected. Third, we sorted out genes 
varying less than 2 standard deviations across all samples. The final gene-set contained 1767 genes 
following filtering. Two-way hierarchical agglomerative cluster analysis was performed using the 
GeneCluster software 18 . We used average linkage clustering with a modified Pearson correlation as 
similarity metric. Genes and arrays were median centered and normalized to the magnitude of 1 
prior to cluster analysis. The TreeView software was used for visualization of the cluster analysis 
results 18 . Multidimensional scaling was performed on median centered and normalized data using 
an implementation in the SPSS statistical software package 
Maximum likelihood classifier 

We based the classifier on the log-transformed expression level ratios. For these transformed values 
we used a normal distribution with the mean dependent on the gene and the group (Ta, Tl, and T2, 
^ respectively) and the variance dependent on the gene only. To classify a sample we calculate the 
' sum over the genes of the squared distance from the sample value to the group mean standardized 
by the variance. Thus we get a distance to each of the three groups and the.sample is classified as 
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belonging to the group where the distance is smallest. When calculating these distances the group 
means and the variances are estimated from all the samples in the training set excluding the sample 
being classified. When using a subset of the genes for classification we calculate for each gene the 
ratio of the variation between the groups to the variation within the groups and select those genes 
with a high value of this ratio (reference to Dudoit, Fridlyand og Speed). As with any classifier the 
classifier here can be criticized for being based on a model that is only partly correct. In particular 
the model does not take into account the correlation among the genes (whether of biological origin 
or due to artifacts in the data processing). However, some important aspects of the data seems to 
be captured allowing for a successful classifier. 

« 
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Parallel gene expression monitoring is a powerful tool for the analysis of relations between 

tumours, for discovering new tumour subgroups, for assigning tumours to pre-defined classes, for 

identifying co-regulated or tumour stage specific genes, and for predicting outcome . In a recent 

study of bladder cancer, we demonstrated functional groups of genes whose co-regulation formed 

the basis for separating bladder tumours into superficial and muscle invasive tumours 18 . We now 

used microarrays with approximately 5000 full-length genes to analyse gene expression and to 

predict tumour classes in 40 bladder tumours selected from a very large clinical specimen bank 

holding more than 35.000 samples from bladder cancer patients, prospectively followed for up to 

six years. The selection was based on the disease course, stage, grade, concomitant carcinoma in 

situ (CIS), and recurrence frequency (number of new tumours per year), in such a way that the 

selected tumours represent six different groups of patients covering the spectrum from relatively 

harmless superficial non-recurring papillary Ta grade 2 tumours, to submucosa invasive stage Tl 

tumours, and finally to primarily muscle invasive T2-4 (T2+) tumours (Table I; see 

Supplementary Information Table 1 for the complete disease courses). RNAfrom tumours and 

from 4 normal tissue samples (a pool of biopsies from 37 patients and 3 single biopsies) was 

labelled and hybridised to Affymetrix oligonucleotide microarrays. Scanning identified the 

expression level of the genes utilising antibody amplification of weakly expressed genes. Genes 

that did not vary throughout the data-set, e.g. housekeeping genes, were eliminated, and only the 

1767 genes (26 %) that showed an expression level change in tumour tissue compared to normal 

urothelium were subjected to cluster analysis. 

A two-way hierarchical cluster analysis of the tumour samples based on the 1767 gene-set 

remarkably separated all 40 tumours according to conventional pathological stages and grades with 

only few exceptions (Fig. la). We identified two main branches containing the superficial Ta 

tumours, and the invasive Tl and T2+ tumours. In the superficial branch two sub-clusters of 

tumours could be identified, one holding 8 tumours that had frequent recurrences and one holding 3 

Y ^ out of the five Ta grade 2 tumours with no recurrences. In the invasive branch, it was notable that 

* four Ta grade 3 tumours clustered tightly with the muscle invasive T2+ tumours. These four Ta 

tumours, from patients with no previous tumour history, showed concomitant CIS in the 
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surrounding mucosa, indicating that this sub-fraction of Ta tumours has some of the more 
aggressive features found in muscle invasive tumours. The stage Tl cluster could be separated into 
three sub-clusters with no clear clinical difference. The one stage Tl grade 3 tumour that clustered 
with the stage T2+ muscle invasive tumours was the only Tl tumour that showed a solid growth 
pattern, all others showing papillary growth. Nine out often T2+ tumours were found in one single 
cluster. The remarkable distinct separation of the tumour groups according to stage, with practically 
no overlap between groups, was also demonstrated by multidimensional scaling analysis (Fig. 1c). 

In an attempt to reduce the number of genes needed for class prediction we identified those 
genes that were scored by the Cancer Genome Anatomy Project (at NCI) as belonging to cancer- 
related groups such as tumour suppressors, oncogenes, cell cycle, etc. These genes were then 
selected from the initial 1767 gene-set, and those 88 which showed largest variation (SD of the gene 
vector >=4) a were used for hierarchical clustering of the tumour samples. The obtained clusters was 
almost identical to the 1767 gene-set cluster dendrogram (Fig. lb), indicating that the tumour 
clustering does not simply reflect larger amounts of stromal components in the invasive tumour 
biopsies. 

The clustering of the 1767 genes revealed several characteristic profiles in which there was a 
distinct difference between the tumour groups (Fig. Id; black lines identifying clusters a to j). 
Cluster a, shows a high expression level in all the Ta grade 3 tumours (Fig. 2a) and, as a novel 
finding, contains genes encoding 8 transcription factors as well as other nuclear genes related to 
transcriptional activity. Cluster c contains genes that are up-regulated in both Ta grade 3 with high 
recurrence rate and CIS, in T2+ and some Tl tumours. This cluster shows a remarkable tight co- 
regulation of genes related to cell cycle control and mitosis (Fig. 2c). Genes encoding cyclins, 
PCNA as well as a number of centromere related proteins are present in this cluster. They indicate 
, increased cellular proliferation and may form new targets for small molecule therapy 19 . Cluster f 
shows a tight cluster of genes related to keratinisation (Fig. 2f). Two tumours (875-1 and 1 178-1) 
had a very high expression of these genes and a re-evaluation of the pathology slides revealed that 



# • 

these were the only two samples to show squamous metaplasia. Thus, activation of this cluster of 
genes promotes the squamous metaplasia not infrequently seen by light microscopy in invasive 
bladder tumours. Cluster g contains genes that are up-regulated in T2+ tumours and in the Ta grade 
3 tumours with CIS that cluster in the invasive branch (Fig. 2g). This cluster contains genes related 
to angiogenesis and connective tissue such as laminin, myosin, caldesmon, collagen, dystrophin, 
fibronectin, and endoglin. The increased transcription of these genes may indicate a profound 
remodelling of the stroma that could reflect signalling from the tumour cells, from infiltrating 
lymphocytes, or both. Some of these may also form new drug targets 20 . It is remarkable that these 
genes are those that most clearly separate the Ta grade 3 tumours surrounded by CIS from all other 
Ta grade 3 tumours. The presence of adjacent CIS is usually diagnosed by taking a set of eight 
biopsies from different places in the bladder mucosa. However, the present data clearly indicate that 
analysis of stroma remodelling genes in the Ta tumours could eliminate this invasive procedure. 

The clusters b, d, e, h, i, and j contain genes related to nuclear proteins, cell adhesion, 
growth factors, stromal proteins, immune system, and proteases, respectively (see Supplementary 
Information). A summary of the stage related gene expression is shown in Table 2. 

An objective class prediction of bladder tumours based on a limited gene-set is clinically 
usefull. We therefore built a classifier using tumours correctly separated in the three main groups as 
identified in the cluster dendrogram (Fig. la). We used a maximum likelihood classification method 
with a "leave one out" cross-validation scheme 11 12 in which one test tumour was removed from the 
set, and a set of predictive genes was selected from the remaining tumour samples for classifying 
the test tumour. This process was repeated for all tumours. Predictive genes that showed the largest 
possible separation of the three groups were selected for classification, and each tumour was 
classified according to how close it was to the mean of the three groups (Fig. 3). The classifier 
\ performance was tested using from 1-160 genes in cross-validation loops, and a model using an 80 
* gene cross-validation scheme showed the best correlation to pathologic staging 0X10" 9 ). The 71 
genes that were used in at least 75% of the cross validation loops were selected to constitute our 
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final classifier model. To test the class separation performance of the 71 selected genes we 
compared their performance to those of a permutated set of pseudo-Ta, Tl and T2 tumours. In 500 
permutations we only detected two genes with a performance equal to the poorest performing 
classifying genes (for detailed information on the classifier see Supplementary Information). 

The classification using 80 predictive genes in cross-validation loops identified the Ta 
group with no surrounding CIS and no previous tumor or no previous tumor of a higher stage 

(Table 1). Interestingly, the Ta tumours surrounded by CIS that were classified as T2 or Tl clearly 

« 

demonstrate the potential of the classification method for identifying surrounding CIS in a non- 
invasive way, thereby supplementing clinical and pathologic information. 

An objective class prediction of bladder tumours based on a limited gene-set could be of 
potential clinical use. We therefore built a maximum likelihood classifier using only those tumours 
(35 out of 40) that showed a group specific expression pattern (Web Figure B). The classifier was 
evaluated through a "leave one out" cross-validation scheme 11 12 and predictive genes that showed 
the largest possible separation of the three groups were selected for classification, and each tumour 
was classified according to how close it was to the mean of the three groups (Fig. 3a). The classifier 
performance was tested using from 1-200 genes in cross-validation loops, and a model using a 38- 
gene cross-validation scheme showed the best correlation to pathologic staging (Web Figure C). 
The 32 genes that were used in at least 75% (27 times) of the cross validations were selected to 
constitute our final classifier model (Web Table B). Interestingly, some of the Ta tumours 
surrounded by CIS were classified as T2, thereby supplementing clinical and pathologic 
information. 

We furthermore tested an outcome predictor able to identify the likely presence or absence 
of recurrence in patients with superficial Ta tumours (see Web Table E for patient disease courses). 
^ The optimal number of genes in cross-validation loops was found to be 39 (75% of the samples 
- were correct classified, p<0.006; Web Figure G; Web Table F) and from this we selected those 26 
genes (Figure 3b) that were used in at least 75% of the cross-validation loops to constitute our final 
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recurrence predictor. Consequently, this set of genes is to be used for predicting recurrence in 
independent samples. We tested the strength of the predictive genes by permutation analysis (Web 
Table G). 

We present data on expression patterns that classify the benign and muscle-invasive bladder 
carcinomas. Furthermore, we can identify subgroups of bladder cancer such as Ta tumours with 
surrounding CIS, Ta tumours with a high probability of progression as well as recurrence, and T2 
tumours with squamous metaplasia. As a novel finding, the matrix remodelling gene cluster was 
specifically expressed in the tumours having the worst prognosis, namely the T2 tumours and 
tumours surrounded by CIS. For some of these genes new small molecule inhibitors already exist 22 , 
and thus they form drug targets. At present it is not possible clinically to identify patients who will 
experience recurrence and not recurrenc, but it would be a great benefit to both the patients and the 
health system by reducing the number of unnecessary control examinations in bladder tumour 
patients. To determine the optimal gene-set for separating non-recurrent and recurrent tumours, we 
again applied a cross-validation scheme using from 1-200 genes. We determined the optimal 
number of genes in cross-validation loops to be 39 (75% of the samples were correct classified, 
p<0.01) and from this we selected those 26 genes (Figure 4) that were used in at least 75% of the 
cross-validation loops to constitute our final recurrence predictor. Consequently, this set of genes is 
to be used for predicting recurrence in independent samples. We tested the strength of the predictive 
genes by performing 500 permutations of the arrays. This revealed that for most of our predictive 
genes we would only in a small number of the new pseudo-groups obtain at least as good predictors 
as in the real groups (see further details in Supplementary Information). 

We present data on expression patterns that classify the different well-known clinical stages 
of bladder carcinoma. Furthermore, we can classify subgroups of bladder dancers such as Ta 
tumours with surrounding CIS, Ta tumours with recurrence potential, and T2 tumours with 
« squamous metaplasia. This has implications for epithelial cancers in general as these may be 
subdivided into a larger number of subclasses than has previously been expected, due to the 
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sensitive way in which microarrays detect even minor tumour variations. As a novel finding, the 
matrix remodelling gene cluster was specifically expressed in the tumours having the worst 
prognosis, namely the T2 tumours and tumours surrounded by CIS. Furthermore, another novel 
distinct molecular feature was the high expression of transcription related genes in Ta tumours. 

The ability to classify bladder tumours, to identify Ta tumours that will recur and to make a 
non-invasive diagnosis of CIS in the bladder is of immediate clinical relevance. In a larger 
perspective many of the differentially expressed genes form new drug targets, e.g. the matrix 
remodelling related genes, for some of which new small molecule inhibitors already exist 22 . 

Methods 

Biological material. 66 bladder tumour biopsies were sampled from patients following removal of 
the necessary amount of tissue for routine pathology examination. The tumours were frozen 
immediately after surgery and stored at -80°C in a guanidinium thiocyanate solution. All tumours 
were graded according to Bergkvist et alP and re-evaluated by a single pathologist. As normal 
urothelial reference samples we used a pool of biopsies (from 37 patients) as well as three single 
bladder biopsies from patients with prostatic hyperplasia or urinary incontinence. Informed consent 
was obtained in ail cases and protocols were approved by the local scientific ethical committee. 
cRNA preparation, GeneChip hybridisation and scanning. Target cRN As were synthesised and 
hybridised to Affymetrix GeneChip Hu6800 oligonucleotide microarrays as recommended. See 
Supplementary Information for detailed descriptions. 

Class discovery using hierarchical clustering. All microarray results were scaled to a global 
intensity of ISO units using the Affymetrix GeneChip software. Other ways of array normalisation 
exist 24 , however, using the dCHIP approach did not change the expression profiles of the obtained 
^ classifier genes in this study (results not shown). For hierarchical cluster analysis and molecular 
4 classification procedures we used expression level ratios between tumours and the normal 

urothelium reference pool calculated using the comparison analysis implemented in the Affymetrix 
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GeneChip software. In order to avoid expression ratios based on saturated gene-probes, we used the 
antibody amplified expression-data for genes with a mean Average Difference value across ail 
samples below 1000 and the non-amplified expression-data for genes with values equal to or above 
1000 in mean Average Difference value across all samples. Consequently, gene expression levels 
across all samples were either from the amplified or the non-amplified expression-data. We applied 
different filtering criteria to the expression data in order to avoid including non-varying and very 
low expressed genes in the data analysis. Firstly, we selected only genes that showed significant 
changes in expression levels compared to the normal reference pool in at least three samples. 
Secondly, only genes with at least three "Present" calls across all samples were selected. Thirdly, 
we eliminated genes varying less than 2 standard deviations across all samples. The final gene-set 
contained 1767 genes following filtering. Two-way hierarchical agglomerative cluster analysis was 
performed using the Cluster software 25 . We used average linkage clustering with a modified 
Pearson correlation as similarity metric. Genes and arrays were median centred and normalised to 
the magnitude of 1 prior to cluster analysis. The TreeView software was used for visualisation of 
the cluster analysis results 25 . Multidimensional scaling was performed on median centred and 
normalised data using an implementation in the SPSS statistical software package* 
Tumour stage classifier. We based the classifier on the log-transformed expression level ratios. 
For these transformed values we used a normal distribution with the mean dependent on the gene 
and the group (Ta, Tl, and T2, respectively) and the variance dependent on the gene only. For each 
gene we calculated the ratio of the variation between the groups to the variation within the groups, 
and selected those genes with a high ratio value. To classify a sample, we calculated the sum over 
the genes of the squared distance from the sample value to the group mean, standardised by the 
variance. Thus, we got a distance to each of the three groups and the sample was classified as 
^ belonging to the group in which the distance was smallest. When calculating these distances the 
* group means and the variances were estimated from all the samples in the training set excluding the 
sample being classified. 
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Recurrence prediction using a supervised learning method. Average Difference values were 
generated using the Affy metrix GeneChip software and all values below 20 were set to 20 to avoid 
very low and negative numbers. We only included genes that had a "Present* ' call in at least 7 
samples and genes that showed intensity variation (Max-Min>100, Max/Min>2). The values were 
log transformed and rescaled. We used a supervised learning method essentially as described 11 . 
Genes were selected using t-test statistics and cross-validation and sample classification was 
performed as described above. 

Immunohistochemistry. Tumour tissue microarrays were prepared essentially as described 26 , with 
four representative 0.6 mm paraffin cores from each study case. Immunohistochemical staining was 
performed using standard highly sensitive techniques after appropriate heat-induced antigen 
retrieval. Primary polyclonal goat antibodies against Smad 6 (S-20) and cyclin G2 (N-19) were 
from Santa Cruz Biotechnology, Santa Cruz, CA. Antibodies to p53 (monoclonal DCK7) and Her-2 
(polyclonal anti-c-erbB-2) were from Dako A/S, Glostrup, Denmark. Ki-67 monoclonal antibody 
(MIBI) was from Novocastra Laboratories Ltd, Newcastle-upon-Tyne, UK. 

Methods 

Biological material 66 bladder tumour biopsies were sampled from patients following removal of 
the necessary amount of tissue for routine pathology examination. The tumours were frozen 
immediately after surgery and stored at -80°C in a guanidinium thiocyanate solution. All tumours 
were graded according to Bergkvist et a/. 23 and re-evaluated by a single pathologist. As normal 
urothelial reference samples we used a pool of biopsies (from 37 patients) as well as three single 
bladder biopsies from patients with prostatic hyperplasia or urinary incontinence. Informed consent 

- was obtained in all cases and protocols were approved by the local scientific ethical committee. 

' RNA purification and cRNA preparation. Total RNA was isolated from crude tumour biopsies 
using a Polytron homogenisator and the RNAzol B RNA isolation method (WAK-Chemie Medical 
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GmbH). 1 0 |ng total RNA was used as starting material for the cDN A preparation. The first and 
second strand oDN A synthesis was performed using the Superscript Choice System (Life 
Technologies) according to the manufacturers instructions except using an oligo-dT primer containing 
a T7 KNA polymerase promoter site. Labelled cRNA was prepared using the Bio Array High Yield 
RNA Transcript Labelling Kit (Enzo). Biotin labelled CTP and UTP (Enzo) were used in the reaction 
together with unlabeled NTP's, Following the 1VT reaction, the unincorporated nucleotides were 
removed using RNeasy columns (Qiagen). 

Array hybridisation and scanning. 15 ng of cRNA was fragmented at 94°C for 35 min in a 
fragmentation buffer containing 40 mM Tris-acetate pH 8. 1, 100 mM KOAc, 30 mM MgOAc. Prior 
to hybridisation, the fragmented cRNA in a 6xSSPE-T hybridisation buffer (1 M NaCl, 10 mM Tris 
pH 7.6, 0.005% Triton), was heated to 95°C for 5 min and subsequently to 45°C for 5 min before 
loading onto the Affymetrix probe array cartridge (HuGeneFL). The probe array was then incubated 
for 16 h at 45°C at constant rotation (60 rpm). The washing and staining procedure was performed 
in the Affymetrix Fluidics Station. The probe array was exposed to 10 washes in 6xSSPE-T at 25°C 
followed by 4 washes in 0.5xSSPE-T at 50°C. The biotinylated cRNA was stained with a 
streptavidin-phycoerythrin conjugate, final concentration 2 jig/nl (Molecular Probes, Eugene, OR) 
in 6xSSPE-T for 30 min at 25°C followed by 10 washes in 6xSSPE-T at 25°C. The probe arrays 
were scanned at 560 nra using a confocal laser-scanning microscope (Hewlett Packard GeneArray 
Scanner G2500 A). The readings from the quantitative scanning were analysed by the Affymetrix 
Gene Expression Analysis Software. An antibody amplification step followed using normal goat 
IgG as blocking reagent, final concentration 0.1 mg/ral (Sigma) and biotinylated anti-streptavidin 
antibody (goat), final concentration 3 \xgfml (Vector Laboratories). This was followed by a staining 
step with a streptavidin-phycoerythrin conjugate, final concentration 2 |ig/|ul (Molecular Probes, 
Eugene, OR) in 6xSSPE~T for 30 min at 25°C and 10 washes in 6xSSPE-T at 25°C. The arrays 
were then subjected to a second scan under similar conditions as described above. 
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Class discovery using hierarchical clustering. All microarray results were scaled to a global 
intensity of 150 units using the Asymetrix GeneChip software. Other ways of array normalisation 
exist 24 , however, using the dCHIP approach did not change the expression profiles of the obtained 
classifier genes in this study (results not shown). For hierarchical cluster analysis and molecular 
classification procedures we used expression level ratios between tumours and the normal 
urothelium reference pool calculated using the comparison analysis implemented in the Affymetrix 
GeneChip software. In order to avoid expression ratios based on saturated gene-probes, we used the 
antibody amplified expression-data for genes with a mean Average Difference value across all 
samples below 1000 and the non-amplified expression-data for genes with values equal to or above 
1000 in mean Average Difference value across all samples. Consequently, gene expression levels 
across all samples were either from the amplified or the non-amplified expression-data. We applied 
different filtering criteria to the expression data in order to avoid including non-varying and very 
low expressed genes in the data analysis. Firstly, we selected only genes that showed significant 
changes in expression levels compared to the normal reference pool in at least three samples. 
Secondly, only genes with at least three "Present* * calls across all samples were selected. Thirdly, 
we eliminated genes varying less than 2 standard deviations across all samples. The final gene-set 
contained 1767 genes following filtering. Two-way hierarchical agglomerative cluster analysis was 
performed using the Cluster software 25 . We used average linkage clustering with a modified 
Pearson correlation as similarity metric. Genes and arrays were median centred and normalised to 
the magnitude of 1 prior to cluster analysis. The TreeView software was used for visualisation of 
the cluster analysis results 25 . Multidimensional scaling was performed on median centred and 
normalised data using an implementation in the SPSS statistical software package. 
Tumour stage classifier. We based the classifier on the log-transformed expression level ratios. 

* * For these transformed values we used a normal distribution with the mean dependent on the gene 

* and the group (Ta, Tl , and T2, respectively) and the variance dependent on the gene only. For each 
gene we calculated the variation within the groups (W) and the three variations between two groups 
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(B(Ta/Tl), B(Ta/T2), B(T1/T2)) and used the three ratios B/W to select genes. We selected those 
genes having a high value of B(Ta/Tl)/W, those genes having a high value of BCTa/T2)/W, and 
those genes with a high value of B(T1/T2)/W. To classify a sample, we calculated the sum over the 
genes of the squared distance from the sample value to the group mean, standardised by the 
variance. Thus, we got a distance to each of the three groups and the sample was classified as 
belonging to the group in which the distance was smallest. When calculating these distances the 
group means and the variances were estimated from all the samples in the training set excluding the 
sample being classified. 

Validation of the tumour stage classifier. The performance of the classifier was validated using 
another set of bladder tumour expression data obtained from customised oligonucleotide Affymetrix 
GeneChips carrying PM probes only. First, we translated all accession numbers on both 
oligonucleotide microarrays into UG-cIusters and selected those gene-probes present on both arrays 
(4416 probe-sets). To make comparisons between the two microarray types we used only the PM 
probe values from the original data set. We rescaled all the log (average PM) values and used the 
pool of normal bladder biopsies from 37 patients, which were analyses on both array platforms, to 
calculate log fold-change expression values. We recalculated the group means and the variances for 
each gene used in the classifier and based the classification on 29 genes from the optimal classifier 
in the cross-validation step for the original dataset For the new samples the distances to each of the 
three groups was calculated and the sample was classified as belonging to the group for which the 
distance was smallest. 

Recurrence prediction using a supervised learning method* Average Difference values were 
generated using the Affymetrix GeneChip sofrware and all values below 20 were set to 20 to avoid 
very low and negative numbers. We only included genes that had a "Present" call in at least 7 
. samples and genes that showed intensity variation (Max-Min>100, Max/Min>2). The values were 
log transformed and rescaled. We used a supervised learning method essentially as described 11 . 
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Genes were selected using t-test statistics and cross-validation and sample classification was 
performed as described above. 

Immunohistochemistry. Tumour tissue microarrays were prepared essentially as described 26 , with 
four representative 0.6 mm paraffin cores from each study case. Immunohistochemical staining was 
performed using standard highly sensitive techniques after appropriate heat-induced antigen 
retrieval. Primary polyclonal goat antibodies against Smad 6 (S-20) and cyclin G2 (N-19) were 
from Santa Cruz Biotechnology. Antibodies to p53 (monoclonal DO-7) and Her-2 (polyclonal anti- 
c-eri>B-2) were from Dako A/S. Ki-67 monoclonal antibody (MIBI) was from Novocastra 
Laboratories Ltd. Staining intensity was scored at four levels, Negative, Weak, Moderate and 
Strong by an experienced pathologist who considered both colour intensity and number of stained 
cells, and who was unaware of array results. 
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• 1i6®»1c j \ ■■ ". - ta.gr3.: • . ' . * ; ' '■* . Sampling ^isft 

133CP1 \ ■ • Tagr3 v ' Sampfirig vIsU 



• Ta . -Ta ';V'.-Ta . 
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V.-,.Ta -v-Ta T -ta.'; 
V f Ta v*'*, Ta ^.Ta"': 
• 7.Ta:.'-'"^-fe^: 'Ta- ^ 



Ta grade III tumours -a prior T1 -&rnourand CIS In selected.site oiopsies 
,747-7 ^TailTI Tagi3 . 3Ta Sampling vlsft 

! 112-10 7 ta,2T1 Tanr3 -2Ta,4 T1 Previous vlsft 
320-T 1Ta r 2T1 ; Tagi3 ' 2Ta , ; Sampling vlsft 
067-3 2T1 V Ta m3 1T1 Sampling visit 



.?'Vv Ta'v.". :-Ta iLv :Ta 

■ ■« T2-^ - T2 -*T2/rf ; 
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j> :■ ;• . • V„-v :t2.,- - '-'71 ■ Ta . 



T1 gra^e.lll tumours - no prtormuscle InvaSfva tumour 



ta:: 
Ta 
•i..T2- 
Ta 



Ta^:^Ta.:' 
Ta Ta* 
t2 ! Ta 
Ta ■ 'Ta 



025-1 
847-1 
1257-^1 
919-1 
88M 
812-1 
1289.^ 
108M 
1238-1 
.1665-1 

nad 

T2+ grade III/ IV tumours — only primary tumours 
«64-1 T2+gr4 
. 1032-1 T2+ gr? 

1117^1 T2+gr3 



1To 



T1gr3: 
T1 gr3 
Tigr? 
T1 gr3 
T1 gr3 
T1gr3 
TI gr3- 
. TI gr3 
T1gr3 
T1g;r3 



4Ta 



1Ta,1T2+ 
3TI 



• No. • • 

No. 
Sampling visit V 
No k : 
No 
No 

No / 
No 
■' No 
Subsequent vfeft 
Sampling visit . 



: No review 
No review 

No.revieW 
T2or3 



TI 
T1 
T1 
TI 
T1 
T1 
T1 
T1 
T1 
T1 
T1 



Ti 
TI 
T1 

ti-; 

T1 
T1 
TI 
TI 
Ti 
TI 
T1 



T1 
T1 
T1 
TI 

. ■ 5T1 : 
Tr 
ti 

Ti 
T1 
. T1 
T1 



1178rf 



T2* gr3 



No 
ND 
ND 
NO 



T2+gr3 
No review 



T2/Tt T1 

T2 T2 

T2 T? 

T2 T2 



TI 
T2 
T^ 
T2 



1078-1 T2+ gr3 

B7S-1 T2*fli3 
1044-1 72+813 1T2* 

1133-1 T2* flT3 

1068-1 T2+gi3 

S37-1 T2»qr3 „ 



ND 




T2 


T2 


T2 


No 




T2 


T2 


T2 


ND 




T2 


T2 


T2 


ND 




T2 


T2 


T2 


No 




T2 


T2 


T2 


ND 


No review 


u T1 


T1 


T1 



a Examples of tumour histology. 
b Carcinoma in situ detected in selected site biopsies at the time of sampling tumour tissue for the 
arrays or at previous or subsequent visits. 

c All tumours were reviewed by a single uro-pathologist and any change compared to the routine 
classification is listed. 

d Molecular classification based on 320, 80, and 20 genes cross-validation loops. 



Table 2 • Summary off stage related gene expression 
_ Functional gene clusters 8 



Tumour stage 


Transcription 


Nuclear 
orocesses 


Prdiferati 


on Matrix 

remodelling 


Extracellular 
matrix 


Immune 
system 


Ta g/2 


? 








u 


At 


Tagi3 


ttt 


tt 


tt 




u 




T1gr3 


f 




tt 5 






V 


T2gr3 


t 




ttt 


ttt 


t 


t . 


Tagr3 + CIS 


ttt 


tt 


ttt 


ttt 


t 


t 



a For a detailed description of gene clusters see Supplementary Information page 6. 
b An increase in gene expression was only found in about half of the samples analysed. 




Figure legends 

* 

Fig. I Two-way hierarchical clustering and multidimensional scaling analysis of gene expression 
data from 40 bladder tumour biopsies, a, Tumour cluster dendrogram based on the 1767 gene-set. 
CIS annotations following the sample names indicate concomitant carcinoma in situ. Tumour 
recurrence rates are shown to the right of the dendrogram as + and ++ indicating moderate and high 
recurrence rates, respectively, while no sign indicates no or moderate recurrence, b, Tumour cluster 
dendrogram based on 88 cancer related genes, c, 2D plot of multidimensional scaling analysis of the 
40 tumours based on the 1767 gene-set. The colour code identifies the tumour samples from the 
cluster dendrogram (Fig. la), d, Two-way cluster analysis diagram of the 1767 gene-set. Each row 
in the diagram represents a gene and each column a tumour sample. The colour saturation 
represents differences in gene expression across the tumour samples; yellow indicates higher 
expression of the gene compared to the median expression (black) and blue indicates lower 
expression of the gene compared to the median expression. The colour intensities indicate degrees 
of gene-regulation. The sidebars to the right of the diagram represent gene clusters a-j and normal 
1-3 in the left side indicate the three normal biopsies and normal 4 indicates the pool of biopsies 
from 37 patients. 

Fig. 2 Enlarged view of the gene clusters a, c, f, and g. The dendrogram at the top is identical to 
Fig. la. a, Cluster of transcription factors and other nuclear associated genes, c, Cluster of genes 
involved in proliferation and cell cycle control, f, Gene expression pattern and corresponding area 
with squamous metaplasia in urothelial carcinoma. The yellow colour indicates genes up-regulated 
in samples 1 178-1 and 875-1, the only two samples with squamous cell metaplasia, g, Cluster of 
genes involved in angiogenesis and matrix remodelling. 

. Fig. 3 Molecular classification of tumour samples using 80 predictive genes in each cross-validation 
a loop. Each classification is based on the closeness to the mean in the three classes. Samples marked 
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with * were not used to build the classifier. The scale indicates the distance from the samples to the 
classes in the classifier, measured in weighted squared Euclidean distance. 

Fig. 4 Gene expression patterns of the 26 genes that we found to be optimal for prediction of 
superficial tumour recurrence. The best predictors of recurrence are listed at the top and bottom of 
the diagram. For each gene the number of times it was used in the 3 1 cross-validation loops is listed 
to the right together with the unigene-cluster number (see more details in Supplementary 
Information). 



Supplementary Information 

Identifying distinct classes of bladder carcinoma using microarrays. 

Lars Dyrskj0t Andersen, Thomas Thykjaer, Mogens Kruhoffer, Jens Ledet Jensen, Niels 
Marcussen, Stephen Hamilton-Dutoit, Hans Wolf & Torben F. 0mtoft 
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Methods 

The following paragraphs contain supplementary information about cRNA preparation, 
chip hybridisation and scanning protocols not described in the paper. 

RNA purification and cRNA preparation 

Total RNA was isolated from crude tumour biopsies using a Polytron homogenisator and 
the RNAzol B RNA isolation method (WAK-Chemie Medical GmbH). 10 ug total RNA was 
used as starting material for the cDNA preparation. The first and second strand cDNA 
synthesis was performed using the Superscript Choice System {Life Technologies) 
according to the manufacturers instructions except using an oligo-dT primer containing a T7 
RNA polymerase promoter site. Labelled cRNA was prepared using the BioArray High Yield 
RNA Transcript Labelling Kit (Enzo). Biotin labelled CTP and UTP (Enzo) were used in the 
reaction together with unlabeled NTP's. Following the IVT reaction, the unincorporated 
nucleotides were removed using RNeasy columns (Qiagen). 

Array hybridisation and scanning 

15 ug of cRN A was fragmented at 94°C for 35 min in a fragmentation buffer containing 40 
mM Tris-acetate pH 8. 1 , 1 00 mM KOAc, 30 mM MgOAc. Prior to hybridisation, the 
fragmented cRNA in a 6xSSPE-T hybridisation buffer (1 M NaCI, 10 mM Tris pH 7.6, 
0.005% Triton), was heated to 95°C for 5 min and subsequently to 45°C for 5 min before 
loading onto the Affymetrix probe array cartridge. The probe array was then incubated for 

16 h at 45°C at constant rotation (60 rpm). The washing and staining procedure was 
performed in the Affymetrix Fluidics Station. The probe array was exposed to 10 washes in 
6xSSPE-T at 25°C followed by 4 washes in 0.5xSSPE-T at 50°C. The biotinylated cRNA 
was stained with a streptavidin-phycoerythrin conjugate, final concentration 2 \iglvA 
(Molecular Probes, Eugene, OR) in 6xSSPE-T for 30 min at 25°C followed by 10 washes 
in 6xSSPE-T at 25°C. The probe arrays were scanned at 560 nm using a confocal laser- 
scanning microscope (Hewlett Packard GeneArray Scanner G2500A). The readings from 
the quantitative scanning were analysed by the Affymetrix Gene Expression Analysis 
Software. An antibody amplification step followed using normal goat IgG as blocking 
reagent, final concentration 0.1 mg/ml (Sigma) and biotinylated anti-streptavidin antibody 
(goat), final concentration 3 ng/ml (Vector Laboratories). This was followed by a staining 
step with a streptavidin-phycoerythrin conjugate, final concentration 2 ug/nl (Molecular 
Probes, Eugene, OR) in 6xSSPE-T for 30 min at 25°C and 10 washes in 6xSSPE-T at 
25°C. The arrays were then subjected to a second scan under similar conditions as 
described above. 
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Samples 

This part contains information about all the samples used for expression profiling. All samples used were 
obtained fresh from surgery and tfte tumour material for expression profiling was frozen immediately at -80°C 
after removing material for histopathologic*! analysis. As reference we used biopsies from normal 
urothetium from donors with prostatic hyperplasia or incontinence. 



b Information — class discovery 

We selected tumours from the entire spectrum of bladder carcinoma for expression 
profiling in order to discover the molecular classes of the disease. The tumours analysed 
are listed in Table 1 below together with the available patient disease course information. 



Group 


'patient 


Previous tumours 


Tumour examined on 
array 


Pattern 


Reviewed 
histology.. 


•Subsequent .tumours 


Carcinoma In sftu* 

« * : . .1 


A 






i a gr z izuu29/) 


Papillary 


Ta gr3 




no 




968-1 




la gr z (unoao) 


Papillary 


♦ 


Tagr 2 (150101) 


no 




934-1 




Tagr 2 (220798) 


Papillary 


+ 




no 




928-1 




Tagr 2 (240698) 


Papillary 


+ 




no 




930-1 




Tagr 2 (300698) 


Papillary 






no 


B 


989-1 




Tagr 3 (281098) 


Papillary 






no 




1264-1 




Tagr 3 (130600) 


Papillary 


+ 


Tagr 2 (231000) 
Tagr 2 (220101) 
Ta gr 2 (300401) 


no 




876-5 


Tagr 2 (230398) 
Tagr 2 (271098) 

1 a gr £> 

Tanr2(011199) 


Tagr 3 (170400) 


Papillary 


+ 




no 




669-7 


Tagr 2 (101296) 
Tagr 2 (150897) 
Tagrl (161297) 
Tagr 3 (270498) 
Ta or 2 ( 2202991 


Tagr 3 (230899) 


Papillary 


Tagr2 


Tagr 2 (120100) 
Tagr 2 (250500) 
Tagr 2 (250900) 
Tagr 2 (050201) 


no 




716-2 


Tagr 2 (070397) 


Tagr 3 (230497) 


Papillary 


+ 


Tagr 2 (040697) 
Ta gr 1 (170698) 


no 


C 


1070-1 




Tagr 3 (150399) 


Papillary 


+ 


Ta or 3 (291 099> 


Subseauent visit 




956-2 




Tagr 3 (061299) 


Papillary 


+ 


Tagr 3 (061 200) 


Sampling visit 




1062-2 




Tagr 3 (120799) 


Papillary 


+ 


T1 gr3 (161199) 


Sampling visit 




1166-1 




Tagr 3 (271099) 


Papillary 


+ 




Sampling visit 




1330-1 




Tagr 3 (311000) 


Papillary 


♦ 




Sampling visit 


D 


112-10 


Tagr 2 (070794) 
Tagr 3 (01 1294) 
T1 gr 3(150695) 
Tagr 3 (121095) 
T1 gr3(040396) 
Tagr 2 (200896) 
Tagr2(111296) 
Tagr 2 (230497) 
Tagr 2 (030997) 


Tagr 3 (060198) 


Papillary 


+ 


Tagr 3 (110698) 
Tigr 3 (191 098) 
Tagr 3 (240299) 
T1gr 3 (050799) 
T1 gr 3 (081 199) 
T1gr 3 (180400) 


Previous visit 




320-7 


Tigr 3 (01 1194) 
T1 gr 3 (150896) 
Tagr 3 (100897) 


Ta gr 3 (290997) 


Papillary 




Tagr 3 (290198) 
Tagr 3 (290698) 


Sampling visit 




747-7 


Tagr 2 (010597) 
Ta gr 2 (220597) 
Tagr 2 (230997) 
Tagr 2 (2601 98) 
T1 gr 3 (270498) 
Tagr 2 (170898) 


Tagr3(161298) 


Papillary 


♦ 


Tagr 2 (050599) 
Tagr 2 (280999) 
Tagr 2,(141299) 


Sampling visit 




967-3 


T1 gr 3 (280998) 
T1 gr 3 (250199) 


Tagr 3 (140699) 


Papillary 




T1 gr 3 (080999) 


Sampling visit 


E 






T1 gr 3 (200996) 


Papillary 


♦ 




No 




847-1 




T1gr 3 (210198) 


Papillary 


♦ 




No 
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1257-1 




T1 or 3 t?dGRGG\ 


OQIlQ 






Sampling visit 




919-1 






"apiiiary 


♦ 




No 




680-1 




T1 or 3 1300388) 


r aputary 




xa gr 2 (u9i is&) 
Tagrl (090399) 
i a gr «c ^uauyuuj 
Taar 2 (190301) 


No 




812-1 




Tlgr 3 (061098) 


PapUlary 


+ 




No 




1269-1 




T1 gr 3 (230600) 


Papillary 






No 1 




1 083-2 


Ta gr 2 (280499) 


T1 gr 3 (120599) 


Papillary 


- 




No 








T1 gr 3 (020500) 


Papillary 


+ 


T2gr 3 (211100) 


No 




1065-1 




T1 gr 3 (160399) 


Papillary 






Subsequent visit 




1134-1 




T1 gr 3 (181099) 


Papillary 


T2gr3 


T1 gr 3 (280200) 
T1 gr 3 (020500) 

t ^JI - 4% >^ A « M i-iril 


Sampling visit 


F 


1164-1 




T2*gr 4 (101299) 


Solid 


sr 3 




No 




1032-1 




T2+gr 7(050199) 


Mixed 






Mot measured 




1117-1 




T2+gr3(010999) 


Solid 


+ 








1178-1 




T2+gr 3(200100) 


Solid 


♦ 




Not measured 




1078-1 




T2+gr 3 (120499) 


Solid 






Not measured 




875-1 




T2+gr 3 (180398) 


Solid 






No 




1044-1 




T2*gr 3 (010299) 


Solid 


♦ 


T2+ gr 3 (060999) 


Not measured 




1133-1 




T2*gr 3 (081099) 


Solid 


+ 




Not measured 




1068-1 




T2+ gr 3 (220399) 


Solid 


♦ 




No 




937-1 




T2+gr 3 (280798) 


Solid 






Not measured 



Group A: Ta gr2 tumours - no recurrence within 2 years. 

Group B: Ta gr3 tumours - no prior T1 tumour and no carcinoma in situ in random biopsies. 
Group C: Ta gr3 tumours - no prior T1 tumour but carcinoma in situ in random biopsies. Group D: Ta gr3 
tumours - a prior T1 tumour and carcinoma in situ in random biopsies. Group E: T1 gr3 tumours - no prior 
T2+ tumour. Group F: T2+ tumours gr3/4 - only primary tumours. 

* Carcinoma in sttu detected in selected site biopsies at previous, sampling or subsequent visits. 
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Patient disease course Information - recurrence vs. no recurrence 

From the hierarchical cluster analysis of the tumour samples we found that the tumours with a high 
recurrence frequency were separated from the tumours with low recurrence frequency. To study this further 
we profiled two groups of Ta tumours- 1 5 tumours with low recurrence frequency and 1 6 tumours with high 
recurrence frequency. To avoid influence from other tumour characteristics we only used tumours that 
showed the same growth pattern and tumours that showed no sign of concomitant carcinoma in situ. 
Furthermore, the tumours were all primary tumours. The tumours used for identifying genes differentially 
expressed in recurrent and non-recurrent tumours are listed in Table 2 below. 



Table 2 Disease course information of all patients involved. 



Grou * 

p 


Patient 


Ti imnt ir 

(date) 


Pattern 


oareinufna in 
situ 


1 lino IU 

recurrence 


A 


968-1 

WWW 1 


Ta nrP 


Panillarv 


no 


£-i rnonxn 


A 


928-1 


Tagi2 


Papillary 


no 


38 month. 


A 


934-1 


Ta rvtO 
lei yi«. 

(220798) 


Papillary 


no 


- 


A 


709-1 


Ta niO 


Panillarv/ 
r dpillof y 


riw 




A 


930-1 


Ta ntO 
(3006981 

^WWwWWW J 


Panillarv 


nn 




A 


524-1 


Ta ar2 
(201095) 


Panillarv 


no 

I IU 




A 


455-1 


Ta gr2 
(060695) 


PaDillarv 


no 


• 


A 


370-1 


Ta gr2 
(100195) 


Papillary 


no 




A 


810-1 


Tagr2 
(031097) 


Papillary 


no 




A 


1146-1 


Tagr2 
(231199) 


Papillary 


no 




A 


1161-1 


Tagr2 
(101299) 


Mixed 


no 




A 


1006-1 


Tagr2 
(231198) 


Papillary 


no 




A 


942-1 


Tagr2 


Papillary 


no 


24 month. 


A 


1060-1 


Tagi2 


Papillary 


no 


36 month. 


A 


1255-1 


Tagr2 


Papillary 


no 


24 month. 


B 


441-1 


Tagr2 


Papillary 


no 


6 month. 


B 


780-1 


Tagr2 


Papillary 


no 


2 month. 


B 


815-2 


Tagr2 


Papillary 


no 


6 month. 


B 


829-1 


Tagr2 


Papillary 


no 


4 month. 


B 


861-1 


Tagr2 


Papillary 


no 


4 month. 


B 


925-1 


Tagr2 


Papillary 


no 


5 month. 


B 


1008-1 


Tagr2 


Papillary 


no 


5 month. 


B 


1086-1 


Tagr2 


Papillary 


no 


6 month. 


B 


1105-1 


Tagi2 


Papillary 


no 


8 month. 


B 


1145-1 


Tagr2 


Papillary 


no 


4 month. 
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B 


1327-1 


Tagr2 


Papillary 


no 


5 month. 


B 


1352-1 


Tagr2 


Papillary 


no 


6 month. 


B 


1379-1 


Tagr2 


Papillary 


no 


5 month. 


B 


533-1 


Tagi2 \ 


Papillary 


no 


4 month. 


B 


679-1 


Ta gr2 


Papillary 


no 


4 month. 


B 


692-1 


Tagr2 


Papillary 


no 


5 month. 



Group A: Primary tumours from patients with no recurrence of the disease for 2 years. 
Group B: Primary tumours from patients with recurrence of the disease within 8 months. 



Hierarchical cluster analysis results 

Here we show expanded views of clusters a-j as identified in the 1 767 gene-duster. The tumour cluster 
dendrogram and colour bars on top of the clusters represents the same tumour cluster as shown In the 
paper. The four samples to the left are normal biopsies (normal 1-3) and a pool of 37 normal biopsies 
(normal 4). 
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Classification of samples 

From the hierarchical cluster analysis of the samples (class discovery) we identified three major "molecular 
classes" of bladder carcinoma highly associated with the pathologic staging of the samples. Based on this 
finding we decided to buiid a molecular classifier that assigns tumours to these three -molecular classes" To 
build the classifier, we only used the tumours in which there was a correlation between the -molecular class" 
and the associated pathologic stage. Consequently, a T1 tumour clustering In the "molecular class" of T2 
tumours was not used to bund the classifier. 

The genes used in the classifier were those genes with the highest values of the ratio (BAN) of the variation 
between the groups to the variation within the groups. High values of the ratio (BAAQ signify genes with good 
group separation performance. We calculated the sum over the genes of the squared distance from the 
sample value to the group mean and classified the sample as belonging to the group where the distance to 
the group mean was smallest. If the relative difference between the distance to the closest and the second 
closest group compared to the distance to the closest group were below 5%, the classification failed and the 
sample was classified as belonging to both groups. The relative difference is refered to as the classifier 
strength. 



The classifier performance was tested using from 1-160 genes in cross-validation loops 
Figure 1 shows that the closest correlation to histopathology is obtained in the cross- 
validation model using from 69-97 genes. Based on this we chose the model using 80 
genes for cross-validation as our final classifier model. 9 
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Figure 1 Number of classification errors vs. number of genes used in cross-validation 
loops. 
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Classifier model using 71 genes 

We selected those genes for our final classifier model that were used in at least 75% (25 times) of the cross- 
validation loops. These 71 genes are listed in table 3. 



Table 3 Feature: Accession number on HuGene fl array. Number Number of times used in the 80 genes 
cross validation loops. Test (B/W): see below. 



Feature 


ILiil'LLUBI 


uescnpiton t 




AF000231 at t 




nABi i a, iTi em Dcr KA3 oncogene family 


33 


26.77 


D13666 A at 

\J l wwwD a at 


Hs.1 36348 




33 


27.71 


D49372 s at 


Hs.54460 


email InrltifiKIn MtfAlflnA biiM^mIUi A /Ptin Pu.\ mamWav 4 4 

smaii inauGiDie cyioxine suoTamiiv m tvys-oys), memoer 1 1 


31 


25.78 


D8392Q at 


KB.252136 


itcoiin (coiraoertniDrtnogen domain-containing) i 


33 


31.18 


D86479 at ! 


Hs.1 18397 


AE-bindlng protein 1 n 


33 


28.29 


IrtwWf f at 


Hs.75367 


Src-1 ike-adaptor 


33 


30.03 


uw9J i r a l 


Hs.89404 


msh (OrosophUa) homeo box homolog 2 


33 


51.50 


U R^flRCLUIT/l'V^Q e at 

nv3Mvoc7-n i Hooy s at i 




Monocyte wnemotactic Protein 1 


27 


25.08 


Hftfi7.UTfi7 f at 




Zinc Finger Protein 


33 


27.81 


nwvr *n i s9Uf at 




M944 


33 


25.76 
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Test for significance 

To test the class separation performance of the 71 selected genes we compared the B/W 
ratios with the similar ratios of all the genes calculated from permutations of the arrays. For 
each permutation we construct three pseudogroups, pseudo-Ta, pseudo-T1, and pseudo- 
T2, so that the proportion of samples from the three original groups is approximately the 
same in the three pseudogroups. We then calculate the ratio of the variation between the 
psudogroups to the variation within the pseudogroups for all the genes. For 500 
permutations we only two times had one gene for which the B/W value was higher than the 
lowest value for the original B/W values of the 71 selected genes (the two values being 
25.28 and 25.93). 
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Supervised learning prediction of recurrence 

In this part of the work we identified genes differentially expressed between non-recurring 
and recurring tumours. Cross-validation and prediction was performed as previously 
described, except that genes are selected based on the value of the Wilcoxon statistic for 
difference between the two groups. 



Prediction performance 

The prediction performance was tested using from 1-200 genes in the cross-validation loops. Figure 3 below 
shows that the lowest error rate (8 errors) is obtained in e.g. the cross-validation model using from 39 genes. 
Based on this we selected this cross-validation model as our final predictor. The results of the predictions 
from the 39 gene cross-validation loops are listed in Table 6. The predictor misclassified four of the samples 
in each group and in one of the predictions the difference in the distances between the two group means is 
below the 5% difference limit as described above. 

The probability of misclassifying 8 or less arrays by a random classification is 0.0053. 
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Figure 3 Number of prediction errors vs. number of genes used in cross-validation loops. 
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Table 6 Recurrence prediction results of 39 gene cross-validation loops. Group A: Primary 
tumours from patients with no recurrence of the disease for 2 years. Group B: Primary 
tumours from patients with recurrence of the disease within 8 months. Prediction, 0=no 
recurrence, 1=recurrence. Prediction strength: see p.8. 
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1 




0.31 


B 


679-1 


Tagr2 


1 




0.82 


B 


692-1 


Tagr2 


1 




0.45 
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Genes for 
classifier 

320 genes 
Chip accession 
numbers 
AB000220_at 
AF000231_at 
D10922_s_at 
D10925_at 
D11086_at 
D11151_at 
D13435_at 
D13666_s_at 

D14520_at 

D21878_at 
D26129_at 
D26443_at 
D42046_at 
D42047_at 
D45370_at 

D49372_s_at 

D49387_at 

D50495_at 

063135_at 

D64053_at 
D83920 at 
D85131_s_at 
D86062_s_at 
D86479_at 
D86957_at 
D86959_at 
D86974_at 
D86976_at 
D87120_at 
D87433~at 
D87443_at 

D87682_at 
D89077_at 
D89377_at 
D90279_s_at 
HG1996- 



160 genes 



AF000231 at 
D13666_s_at 
D21878_at 
D45370_at 
D49372_s_at 
D83920_at 
D85131_s at 
D86062_s_at 

D86479_at 

D86957_at 
D86976_at 
D87433_at 
D89077_at 
D89377_at 
HG3044- 
HT3742_s_at 
HG371- 
HT26388_s_at 
HG4069- 
HT4339_s_at 
HG67- 
HT67_f_at 
HG907- 
HT907_at 
J02871_sjat 
J03040_at 
J03068_at 
J03241_s_at 
J03278_at 
J03909_at 
J04058_at 
J04130_s_at 
J04162_at 
J04456_at 
J05032_at 
J05070_at 

J05448_at 
K01396_at 
K03430_at 
L13698_at 
L13720 at 



80 genes 40 genes 



20 



AF000231_at 
D13666_s_at 
D49372_s_at 
D83920_at 
D86479_at 
D87433_at 
D89077_at 
D89377_at 

HG4069- 
HT4339_s_at 
HG67-HT67 f_at 
HG907-HT907_at 
J02871_s_at 
J03278_at 
J04058_at 
J05032 at 



D83920 
D89377 
J02871_s 
J05032 
J05070 
M16591_s 
M23178 s 
M320T1. 

M33195 



at D89377_at 

at J05032_at 

at M23178_s_at 

at M32011_at 

at M69203_s_at 

at S77393_at 

at U07231_at 
.at U41315_ma1_ 

8 Qt 

at U47414""at 



M57731_s_at 
M68840 at 

M69203_s_at 
S77393_at 
U01833_at 
U07231 at 



J05070_at U09937_ma1_ 

s dt 

J05448_at U20158lat 

K01396_at U41315_ma1_ 

s st 

L13720 at U47414~at 



L40904 
M12125 
M15395 

M16591_S 
M20530 

M23178_s 
M32011 
M33195 

M55998_s 

M57731 s 
M63262" 
M68840 

M69203_s 
M72885__rna1_s~ 
M83822 
S77393" 
U01833 



at U49352_at 

at U50708_at 

at U65093_at 

at U68385_at 

at U77970_at 

at U90549_at 

at X13334_at 

at X15880_at 

at X15882_at 

at X51408 at 

at X53800_s at 
at X54489_ma1_ 
at 

at X57579 s_at 

at X64072_s_at 

at X67491_f_at 

"at X68194_at 

at X73882 at 



U49352_at 
U50708_at 
U77970_at 
X13334_at 
X57579_s_at 
X64072_s_at 

X68194_at 

X73882_at 

X7852CLat 

Z48605_at 

Z74615 at 



10 ge 



D89377 
S7739 

U4131S_ma1 

U4741 
U7797 
X6819 
X7388 
X7852 

Z4860 

Z7461 



58 



HT2044 at 

HG2090- L13923_at 
HT2l52_s_at 

HG2379- L15409_at 
HT3996_s_at 

HG2463- L17325_at 
HT2559_at 

HG2724- L19872_at 
HT2820_at 

HG3044- L27476_at 
HT3742_s at 

HG3187- L33799_at 
HT3366 s at 

HG3342- L40388_at 
HT3519_s_at 

HG371- L40904_at 
HT26388_s_at 

HG4069- L41919_ma1_a 



HT4339_s_at 
HG67-HT67_f_at 
HG907-HT907_at 
J02871_3_at 
J03040_at 
J03060_at 
J03068_at 
J03241_s_at 
J03278_at 
J03909 at 



t 



M11433_at 
M11718_at 
M12125_at 
M14218_at 
M15395_at 
M16591_s_at 
M17219_at 
M20530_at 
M23178_s_at 
J03925_at M28130_rna1_s 

at 



J04056_at 
J04058_at 

J04130_s_at 
J04152_rna1_s_at 
J04162_at 
J04456 at 
J05032lat 
J05070_at 
J0S448_at 
K01396_at 
K03430_at 

L06797 s at 



M29550_at 
M31165_at 
M32011_at 
M33195_at 
M37033_at 
M37766_at 
M55998_s_at 
M57731_s_at 
M62840 at 
M63262_at 
M68840 at 



M69203_s_at 
L07956_at M72885_rna1_s 

at 



L10343_at 
L11672_r_at 
L13391 at 
L13698_at 
L13720_at 
L13923_at 
L15409 at 



M77349_at 
M82882_at 
M83822_at 
M92934_at 
M95178_at 
S69115_at 
S77393 at 



U07231_at 

U09937_ma1_s_at 

U10550_at 

U20158_at 

U28488_s_at 

U29680__at 

U41315_rna1_s_at 

U47414_at 

U49352_at 

U50708_at 
U52101_at 
U59914__at 
U64520__at 
U65093_at 
U68019_at 
U68385_at 
U74324_at 
U77970_at 
U90549_at 

X04085_rna1_at 

X07438_s_at 
X07743_at 
X13334_at 
X14046_at 
X15880_at 
X15882_at 
X51408_at 

X53800_s_at 
X54489_rna1_at 

X57579_s_at 
X62048 at 

X64072_s_at 

X67491_f_at 
X68194_at 
X73882_at 
X78520_at 
X97267_rna1 s_at 
Y00787ls_at 
Z12173 at 



X78520_at 
Z29331_at 
Z48605_at 
Z74615 at 



L17325_at 
L19872 at 
L20971~at 
L22548>t 
L25444 at 



U09937 ma1_s 
at 



S78187_at 
U01833 at 
U0723Cat 
U09278 at 



Z19554_s_at 
Z26491_s at 
Z29331_at 
Z48605 at 
Z74615 at 



L27476_at U10550_at 

L29008_at U12424_s_at 

L33799_at U16306_at 

L40388_at U20158 at 

L40904_at U20536_slat 

L41559_at U24266_at 

L41919_rna1 at U28249_at 

L42450_at U28488_s_at 

L42621_at U29680 at 

L43821_at U37143~at 

M1 1 433_at U38864_at 

M1 1 718_at U39840_at 

M11749_at U41315_ma1_s 



M 121 25 at U44111_at 

M14058 at U47414_at 

M14218_at U49352_at 

M15395_at U50708_at 

M16591_s_at U52101_at 

M16937_at U59914_at 

M17219 at U60205_at 

M19309_s_at U61981 at 

M19720Lma1_at U64520~at 

M20530_at U65093_at 

M23178_s_at U66619_at 

M24283 at U68019_at 

M24902 at U68385_at 

M27394_s_at U68485_at 

M27436_s_at U74324_at 

M28130_rna1_s_a U77970_at 
t 

M28211 at U83303 cds2 



M29550 at U88871__at 

M29971_at U90549_at 

M31165 at U90716 at 

M32011 at V00594lat 

M33195 at V00594_s_at 

M33374 at X02761 s_at 

M34309 at X040?1_at 

M37033 at X04085 rna1_a 



M37766_at X07438_s_at 
M55067 at X07743_at 
M55153_at X13334_at 



at 



X14046_at 
X14813 at 
X15880_at 
X15882_at 
X51408_at 
X53800_s_at 



M55998_s_at 
M57731_s_at 

M59465_at 

M60278_at 

M62505_at 

M62840_at 

M63256_at X54489_ma1_a 

t 

at X57351_s_at 

at X57579_s_at 

at X58072_at 

at X62048_at 

at X64072 s at 

a X65614_at 
t 

at X66945_at 

at X67491 f at 

at X68194_at 

at X73882 at 

at X7852(Tat 

at X78549_at 

at X78565_at 

at X78669_at 

.at X83618_at 

at X84908_at 

at X90908_at 

at X91504 at 

.at X95632_slat 

_at X97267_rna1_s 
_at 

_at Y00705_at 

.at Y00787 s_at 

at Y00815_at 

_at Y08374_rna1_a 
t 

at Z12173_at 

at Z19554 s at 

_at Z2649r S _at 

at Z29331_at 

_at Z35491_at 

_at Z48199_at 

_at Z48605 at 

_at Z74615_at 
.at 
at 
at 

a 
t 

at 
at 



M63262 
M64925 
M68840 
M69066 
M69203_sl 
M72885_ma1_s. 

M74719 
M77349 
M81118_ 
M82882 

M83652_s" 
M83822 
M92934" 
M93426 
M95178 
M95787 
M98528 
M98539 

S49692_s 
S59049. 

S62539 
S69115" 
S77393 
S78187. 

S83325_s 
U01691_s 
U01833 
U03851 
U05227" 
U05861 
U06681 
U07231 
U08021" 
U09278 
U09578* 
U09770 
U09937_ma1_s 

U10099_s 
U10550" 



Ul2424_s__at 
U12535 at 
U12778_at 
U16306_at 

U19713 s_at 
U20158_at 

U20536_s at 
U24266>t 
U24577_at 
U28249 at 
U28368_at 

U28488_s_at 
U29680_at 
U29953_rna1_at 
U30313 at 
U33818_at 
U36735_at 
U36341_ma1_at 
U37143_at 
U37431_at 
U38175_at 
U38864 at 
U3984(fat 
U40490_at 
U40705_at 
U41315_ma1_s_a 
t 

U41745_at 
U42360 cds2_at 
U44111_at 

U45878_s_at 
U46461_at 
U47414_at 
U49362_at 
U50534_at 
U50708 at 

U51010_s_at 
U51711_at 
U52101_at 
U52960_at 
U53003_at 
U53225 at 

U58046_slat 
U59913_at 
U59914_at 
U60205_at 
U61981 at 
U62389_at 
U63289_at 
U63824_at 
U64520 at 



U65093 at 
U66619_at 
U68019_at 
U68385 at 
U68485 at 
U70063""at 
U73514 at 
U74324_at 
U77970_at 

U78027_ma4_at 
U79271_at 
U79751_at 
U80456_at 

U83303_cds2_at 
U88871_at 
U89942__at 
U90549_at 
U90716 at 
U91985 at 
V00594~at 
V00594_s at 

X00371_rnaCat 
X02761 s_at 
X03663_at 
X04011_at 

X04085_rna1 at 
X04500_at 
X04602_s_at 
X04741_at 
X0S256_at 
X07203_at 
X07438_s_at 
X07743_at 
X12530_s_at 
X13334 at 
X14046>t 
X14813_at 

X15306_rna1_at 
X15573_at 
X15880 at 
X15882_at 
X17042_at 
X17644 s_at 
X51408_at 
X51757_at 
X51823_at 
X52022_at 
X53331_at 
X53800 s at 

X54489_ma1_at 
X56687 s at 



X57351_s at 

X57579_s>t 
X58072_at 
X59770 at 
X62048_at 
X62466_at 
X62535_at 
X64044_at 

X64072 s_at 
X65614 at 
X66945lat 

X67491_f_at 
X68194 at 
X73882lat 
X75042 at 
X7852Cfat 
X78549_at 
X78565_at 
X78669_at 
X82209 at 
X83572_at 
X83618_at 
X84908_at 
X86098_at 

X89109_s at 
X90858_at 
X90908_at 
X91504_at 
X93036_at 
X95097_ma1_s_a 
t 

X95592 at 
X95632_s_at 
X95677_at 
X97267_rna1_s_a 
t 

Y00705_at 
Y00787 s_at 
Y008?5_at 
Y07867_at 
Y08374 rna1_at 
Y12556 at 
Z12173lat 
Z19554 s_at 
Z26491~s_at 
Z29331_at 
Z35278_at 
Z35491_at 
Z48199_at 
Z48579_at 
Z48605lat 



26 gene recurrence predictor 

We selected the genes used in at feast 29 of the 31 cross-validation loops to constitute our final recurrence 
prediction model. These 26 genes are listed in table 7. 



Table 7 The 26 genes that we find optimal for recurrence prediction. 



rcaiuip 


1 InlflAMB 


uescnpuon i 


■.'■'I..I.UW 


Tn-4 AM M\ 

lesx tw-wi 


AF006041 at 


Hs.336916 


death-associated protein 6 


31 


0.054(161-7) 


D21337 at 


Hs.408 


collagen, type IV, alpha 6 


Ol 


0.058 (160-6) 


D49387 at 




NADP dependent leuKotriene b4 12-hvdroxydehydrogenase 


31 


0.118 (313-8) 


064154 at 


Hs.90107 


cell membrane glycoprotein. 1 10000M(r) (surface antigen)* 


31 


0.078(165-9) 


D83780 at 


Hs.8294 


KIAA019B aene oroduct 


31 


0 094 M 59-41 


D87258 at 


Hs.751 1 1 


t** ViCOOC, 9CIIII6| 1 1 \lwr MIIIUIIIUI 


30 


0 112 M 68-1 11 


087437 at 


Ks 15087 I 


wf ii uiiiU9Uiiits i upcii icauiny name? iy j 


31 


0 058 HfiO-ffl 

If.UwO \ I OLrwJ 


HG1879-HT1919 at 




r»c»o— uf\c ri uicui iwiu 


31 


0 199. f3 14-71 


HG3076-HT3238 3 at 




Heteroaeneoijei Nuclear Ribonucl^nnratein YC Ait SnlfcA \ 


31 


0 080 /309-171 


HG511-HT511 at 




Ras Inhibitor Irrf 


31 




L34155 at 


Hs.83450 


la mini n alnha % 


0 1 


0 122 £314-71 


L38928 at 


Hs.118131 


5,1 O-methenyftetrahydrofoIate synthetase (5- 
formyttetrahvdrofolate cvcto-ligase) 


29 


0.348 (319-2) 


L49169 at 


Hs.75678 


FBJ murine osteosarcoma viral oncogene homoloq B 


31 


0.108(155-2) ! 


Ml 6938 s at 


Hs.820 


homeo box C6 


29 


0.09(170-16) 


M63175 at 


Hs.80731 


autocrine motility factor receptor 


29 


0.098 (308-18) 


M54572 at 


Ks.153932 


protein tyrosine phosphatase, non-receptor tvoe 3 


31 


0.064 (305-31) 


M98528 at 


Hs.79404 


neuron-specific protein 


31 


0.122(314-7) 


U21858_at 


H3.60879 


TAF9 RNA polymerase II, TATA box binding protein (TBP)- 
assoclated factor. 32 kO 


31 


0.122(314-7) 


U45973 at 


HS.1 78347 


SKIP for skeletal muscle and kidney enriched inositol 
phosphatase 


31 


0.094(310-14) 


U58516 at 


Ha.3745 


milk fat olobule-EGF factor 8 protein 


29 


0.100(175-28) 


U62015 at 


Hs.8867 


cysteine-rich, angiogenic inducer. 61 


31 


0.106(169-13) 


U66702 at 


Hs.74624 


protein tyrosine phosphatase, receptor type. N polypeptide 2 


31 


0.146 (149-1) 


U70439 & at 


Hs.84264 


acidic protein rich in leucines 


30 


0.08 (309-17) 


U948S5 at 


Hs.7811 


eukaryotic translation initiation factor 3. subunit 5 ( 


30 


0.092 (311-12) 


X63469 at 


Hs.77100 


general transcription factor HE. polypeptide 2 


31 


0.092(311-12) 


Z23064 at 


Hs.146381 


RNA binding motif protein. X chromosome 


30 


0.066 (307-24) 



Number Number of times the gene has been used in a cross-validation loop. Test The 
numbers in parenthesis are the value W of the Wilcoxon test statistic for no difference 
between the two groups together with the number N of genes for which the Wilcoxon test 
statistic is bigger than or equal to the value W. The test value is obtained from 500 
permutations of the arrays. In each permutation we form new pseudogroups where both of 
the pseudogroups have the same proportion of arrays from the two original groups. For 
each permutation we count the number of genes for which the Wilcoxon test statistic 
based on the pseudogroups is bigger than or equal to W, and the test value is the 
proportion of the permutations for which this number is bigger than or equal to N. Thus the 
test value measures the significance of the observed value W. Consequently, for most of 
our selected genes we only find as least as good predictive genes in about 10% of the 
formed pseudogroups. 
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Web Table B: Patient disease course information - recurrence vs. no recurrence 



Group 


Patient 


Tumour (date) 


Pattern 


Carcinoma to sms 


Time to recurrence 


A 


968-1 


Taar2 


Papillary 


no 


27 month 


A 


928-1 


Taar2 


Papillary 


no 


38 month. 


A 


934-1 


Ta qr2 (220798) 


Papillary 


no 


- 


A 


709-1 


Taoi2 (210798) 


Papillary 


no 


- . 


A 


930-1 


Ta qr2 (300698) 


Papillary 


no 


- 


A 


524-1 


Ta ot2 (201 095) 


Papillary 


no 


- 


A 


455-1 


Taar2 (060695) 


Papillary 


no 


- 


A 


370-1 


Ta ot2(100195) 


Papillary 


no 


- 


A 


810-1 


Ta 012 (031 097) 


Papillary 


no 


- 


A 


1146-1 


Taor2 (231199) 


Papillary 


no 


- 


A 


1161-1 


Taar2 (101299) 


Mixed 


no 


- 


A 


1006-1 


Taor2 (231198) 


Papillary 


no 


- 


A 


942-1 


Tagri" 


Papillary 


no 


24 month. 


A 


1060-1 


Ta fl r2 


Papillary 


no 


36 month. 


A 


1255-1 


Taqr2 


Papillary 


no 


24 month. 


B 


441-1 


Tapr2 


Papillary 


no 


6 month. 


B 


780-1 




Papillary 


no 


2 month. 


B 


815-2 


Taar2 


Papillary 


no 


6 month. 


B 


829-1 


Tasr2 


Papillary 


no 


4 month. 


B 


861-1 


Taor2 


Papillary 


no 


4 month. 


B 


925-1 


Taar2 


Papillary 


no 


5 month. 


B 


1008-1 


Tagr2 


Papillary 


no 


5 month. 


B 


1086-1 


Tao.r2 


Papillary 


no 


6 month. I 


B 


1105-1 


TaflT2 


Papillary 


no 


8 month. 


B 


1145-1 


Tafir2 


Papillary 


no 


4 month. ! 


B 


13Z7-1 


Tagr2 


Papillary 


no 


5 month. 


B 


1352-1 


Ta fl r2 


Papillary 


no 


6 month. 


B 


1379-1 


Taflr2 


Papillary 


no 


5 month. 


B 


533-1 


Taar2 


Papillary 


no 


4 month. [ 


B 


679-1 


Tanr2 


Papillary 


no 


4 month. 


B 


692-1 


Taor2 


Papillary 


no 


5 month. 



Group A: Primary tumours from patients with no recurrence of the disease for 2 years. 
Group B: Primary tumours from patients with recurrence of the disease within 8 months. 



Web Figure C: Number of classification errors vs. number of genes used in cross 
validation loops. 

Classification performance 
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Web Table E: Patient disease course information - recurrence vs. no recurrence 



Group 


Patient 


Tumour (date) 


Pattern 


Carcinoma in situ 


Time to recurrence 


A 


968-1 


Taor2 


Papillary 


no 


27 month 


A 


928-1 


Taar2 


PapBlary 


no 


38 month. 


A 


934-1 


Ta ar2 (220798) 


Papillary 


no 


- 


A 


709-1 


Taar2 (210798) 


Papillary 


no 


* 


A 


930-1 


Ta gr2 (300698) 


Papillary 


no 


- 


A 


524-1 


Ta qr2 (201095) 


Papillary 


no 


- 


A 


455-1 


Ta ar2 (060695) 


Papillary 


no 


- 


A 


370-1 


Taqr2 (100195) 


Papillary 


no 


• 


A 


810-1 


Ta m2(031 097) 


Papillary 


no 


- 


A 


1146-1 


Tagr2(231199) 


Papillary 


no 


- 


A 


1161-1 


Tagr2 (101299) 


Mixed 


no 


• 


A 


1006-1 


Ta ai2 (231 198) 


Papillary 


no 


- 


A 


942-1 


Tasr2 | 


Papillary 


no 


24 month. 


A 


1060-1 


Taar2 


Papillary 


no 


36 month. 


A 


1255-1 


Taar2 


Papillary 


no 


24 month. 


B 


441-1 


Ta fl r2 


Papillary 


no 


6 month. 


B 


780-1 


Taar2 


Papillary 


no 


2 month. 


B 


815-2 


Taar2 


Papillary 


no 


6 month. 


B 


829-1 


Tagr2 


Papillary 


no 


4 month. 


B 


861-1 


Tafli2 


Papillary 


no 


4 month. 


B 


925-1 


Taar2 


Papillary 


no 


5 month. 


B 


1008-1 


Tafli2 


Papillary 


no 


5 month. 


B 


1088-1 


Tagr2 


Papillary 


no 


6 month. 


B 


1105-1 


Taar2 


Papillary 


no 


8 month. 


B 


1145-1 


Ta ar2 


■sssnsm 


no 


4 month. 


B 


1327-1 


Tagr2 




no 


5 month. 


B 


1352-1 


Taflr2 




no 


6 month. 


B 


1379-1 


Taor2 




I no 


5 month. 


B 


533-1 


Taar2 


IB:i.Ulil W 


1 no 


4 month. 


B 


679-1 


Taor2 




no 


4 month. 


B 


692-1 


Taoi2 


ll^V'lilMt'H 


1 no 


5 month. 



Group A: Primary tumours from patients with no recurrence of the disease for 2 years. 
Group B: Primary tumours from patients with recurrence of the disease within 8 months. 



# 



Web Figure F: Number of classification errors vs. number of genes used in cross 
validation loops. 



Cross-validation performance 
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Web Table A: Patient disease course information - class discovery 



Group 


^atient 


Previous tumours 


rumour examined on 
array 


Pattern 


Reviewed 
histology 


Subsequent tumours |C 


arctnoma in smr 


A 7 


09-1 




a gr 2 (200297) P 


apfllary 1 


agr3 


|no 




168-1 




agr2(011098) P 


apillary -» 


T 


a gr 2 (150101) no 


& 


Q4-1 


1 


a gr 2 (220798) P 


■apillary «• 




(no 


£ 


128-1 


1 


a gr 2 (240698) F 


'apillary « 


• 


jno 


S 


130-1 


1 


a gr 2 (300698) F 


»aplllary < 




| no 


B £ 


J89-1 


1 


ragr 3 (281098) f 


>aplllary < 


► 


In 


0 


1 


1 264-1 


l 


ragr 3 (130600) I 


'apillary - 


1 

1 


ragr 2 (231000) n 
ragr 2 (220101) 
ra qr 2 (300401) 


o 




176-5 


ragr 2 (230398) 
Tagr 2 (271 098) 
fa ar 2 (090699) 
raar2(011199) 


ragr 3 (170400) 1 


'apillary 


► 


|r 


10 




869-7 


Tagr 2 (101296) 
i a gr £ \i 3w»f / 
Ta gr 1 (161297) 
Tagr 3 (270498) 
Taar 2 (220299) 


Tagr 3 (230899) 


Papillary 


Tagr2 


Tagr 2 (1201 00) r 
Tagr 2 (250500) 
Ta ar 2 (250900) 
Tagr 2 (050201) 


to 




716-2 


Tagr 2 (070397) 


Tagr 3 (230497) 


Papillary 




Ta or 2 (040697) I 
TaaM (170698) I 


to 


C 


1070-1 




Tagr 3 (150399) 


Papillary 


* 


Tagr 3 (291099) | 


Subsequent visit 




956-2 




Tagr 3 (061 299) 


Papillary 




Tagr 3 (061200) | 


Sampling visit 




1062-2 




Tagr 3 (120799) 


Papillary 


♦ 


Tlgr 3 (161199) j 


Sampling visit 




lice 4 




i a gr o \£ r i vw/ 


r afnuary 






Sampling visit 




1330-1 




Tagr 3 (31 1000) 


Papillary 


+ 




Sampling visit 


o 


112-10 


Tagr 2 (070794) 
Tagr 3 (011294) 
T1 gr 3(150695) 
Tagr 3 (121095) 
Tlgr 3(040396) 
Tagr 2 (200896) 
Tagr2(111296) 
Tagr 2 (230497) 
Ta or 2 (030997) 


Tagr 3 (0601 98) 


Papillary 


+ 


Tagr 3 (11 0698) 
TI gr 3 (191098) 
Ta gr 3 (240299) 
Tlgr 3 (050799) 
T1 gr 3 (081 199) 
T1 gr3*(1 80400) 


Previous visit 




320-7 


T1 gr 3 (011194) 
T1 gr 3 (150898) 
Ta or 3 (100897) 


Tagr 3 (290997) 


Papillary 


+ 


Tagr 3 (2901 98) 
Ta gr 3 (290698) 


Sampling visit 




747-7 


Tagr 2(010597) 
Tagr 2 (220597) 

Tfl nr O f 230Q.Q7\ 

1 a gr ^ (6<9U99f / 

Tagr 2 (2601 98) 
Tlgr 3 (270498) 
Ta or 2 (170898) 


Tagr 3 (161298) 


Papillary 


+ 


Tagr 2 (050599) 

Ta or 2 (280999) 
Tagr 2 (141299) 


Sampling visit 




57Q/-0 


T1 or 3 (2501 99) 


1 a Of O \ I *r\JW&J 




if 


TI gr 3 (080999) 


Sampling visit 


E 


625-1 




T1 gr 3 (200996) 


Papillary 


+ 




[No 




847-1 




T1 gr 3 (210198) 


Papillary 


+ 




No 




1257-1 




Tlgr 3 (240500) 


Solid 






(Sampling visit 




fHQ.1 

91^*1 




T1 or 3 (2206981 


Paoillarv 

raptuoi y 


+ 




|no 




flAft.1 
OOw-l 




T1 or 3 f3rJQ398) 


Paoillarv 


+ 


Tagr2(091198) 
Ta gr 1 (090399) 
Tagr 2 (050900) 
Ta or 2 (190301) 


[No 




812-1 




T1 gr 3 (061098) 


Papillary 






No 




1269-1 




Tlgr 3 (230600) 


Papillary 






No 




1083-2 


Tagr 2 (280499) 


T1 gr 3 (120599) 


PapiDary 






No 




1238-1 




Tlgr 3 (020500) 


Papillary 


* 


T2gr 3 (211100) 
Taar 2(211100) 


No 




1065-1 




T1 gr 3 (160399) 


Papillary 






[Subsequent visit 




1134-1 




TI gr 3 (181099) 


Papillary 


T2gr3 


TI gr 3 (280200) 
TI gr 3 (020500) 
T1ar3(131100) 


Sampling visit 
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r 


4 4 CZA 4 




T?4» m A /mi qqq\ 


9 Villi 


gr 3 




No 




1032-1 




TO* rrr *> /ncni QQ\ 








Not measured 




1117-1 






OVlHJ 


4> 




Sampling visit 




4 47CL1 

1170-1 






Solid 


+ 




Not measured 




1076-1 




T2* gr 3 (120499) 


Solid 


♦ 




Not measured 




875-1 




T2*gr 3 (160398) 


Solid 






No 




1044-1 




T2+gr 3 (010299) 


Solid 


+ 


T2+gr 3 (060999) 


Not measured 




1133-1 




T2+gr3(081099) 


Solid 


+ 




Not measured 




1068-1 




T2+ gr 3 (220399) 


Solid 






No 




937-1 




T2+ gr 3 (280798) 


Solid 






Not measured 



Group A: Ta gi2 tumours - no recurrence within 2 years. 

Group B: Ta gr3 tumours - no prior T1 tumour and no carcinoma in situ in random biopsies. 
Group C: Ta gr3 tumours - no prior T1 tumour but carcinoma in situ in random biopsies. Group D: Ta gi3 
tumours - a prior T1 tumour and carcinoma in situ In random biopsies. Group E: T1 gr3 tumours - no prior 
T2+ tumour. Group F: T2+ tumours gi3/4 - oniy primary tumours. 

* Carcinoma in situ detected in selected site biopsies at previous, sampling or subsequent visits. 
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Web Table B: The 32 genes used in at least 75% (27 times) of the cross validation loops. 



Feature \ 


PHEBSEH 


Description 


Number 


Test (B/W) 


Testgroup 


D8392Q at 1 


Hs.252136 


ficolin (collaaenffibrlnogen domain-containing) t 


31 


33.62 


3 


HG67-HT67 f at 


NA 


zinc finger protein SBZF3 


35 


51.47 


1 


HG907-HT907 at 


Hs.37936 


suppressor of variegation 3-9 (Orosophila) 
homolog 1 


35 


43.63 


1 


J05032 at f 




a&partyl-tRNA synthetase 


35 


44.30 


1 


K01396.at 


Hs.297681 


serine (or cysteine) proteinase inhibitor, clade A 
(alpha-1 anti proteinase, antitrypsin), member 1 


31 


34.24 


3 


M16591 s at 


HS.89555 


hemopoietic cell kinase 


35 


38.71 


3 


M32011 at 


Hs.949 


neutrophil cytosotlc factor 2 (65k0. chronic 
Granulomatous disease, autosomal 2) 


35 


48.35 


3 


M33195 at 


Hs.743 


Fc fragment of IgE, high affinity 1, receptor for; 
gamma polypeptide 


29 


33.12 


3 


M37033 at 


Hs.62212 


CD53 antigen 


33 


34.08 


3 


M57731 s at 


Hs.75765 


GR02 oncogene 


35 


37.07 


3 


M63262 at 


NA 


Arachidonate 5-0poxygenase-activating protein 


35 


37.52 


3 


S77393 at 




ESTs 


35 


85.04 


2 


U01833 at 




nucleotide binding protein 1 fE.coli MinD like) 


35 


54.81 


1 


U07231 at 


HS.309763 


G-rich RNA sequence bindina factor 1 


35 


80.54 


2 


U41315 rnal s at 




ring zinc-finger protein (ZNF1Z7-Xp) 


35 


89.24 


2 


U47414 at 


Ha.79069 


cvclinG2 


35 


82.49 


2 


U50708 at 


Hs.1265 


branched chain keto acid dehydrogenase El , 
beta polypeptide (maple syrup urine disease) 


35 


48.75 


1 


U52101 at 


Hs.9999 


epithelial membrane protein 3 


34 


34.39 


3 


U74324 at 


Hs.90875 


RAB interacting factor 


35 


47.87 


1 


U77970_at 


NA 


neuronal PAS domain protein 2 (NPAS2) 


30 


72.77 


2 


U90549 at 


Hs.236774 


high-mobility group (nonhistone chromosomal) 
protein 17-Bke3 


35 


48.41 


1 


X13334 at 


Hs.75827 


CD 14 antigen 


34 


35.00 


3 


X54489 rnal at 


NA 


melanoma growth stimulatory activity 


34 


75.37 


I 2 


X57579^s n- at 


Ha.727 


inhibin. beta A (activin A, activin AB alpha 
polypeptide) 


35 


89.41 


2 


X64072 a at 


Hs.83988 


integrin, beta 2 (antigen CD18 (p95), lymphocyte 
function-associated antigen 1; macrophage 
antigen 1 (mac-1) beta subunft) 


35 


40.08 


3 


X68l94_at 


Ha.80919 


synaptophysln-fike protein 


29 


72.29 


2 


X73882 at 


Hs. 146388 


mlcrotubu le-assoelated protein 7 


35 


89.29 


2 


X78520 at 


Hs.174139 


chloride channel 3 


35 


83.36 


2 


X95632 s at 


Hs.343575 


afaMnteractor 12 (SH3-containing protein) 


33 


41.11 


1 


Z29331 at 


Hs.28505 


ubiquitirKonjugating enzyme E2H (homologous 
to yeast UBC8) 


35 


63.45 


1 


Z48605 at 


Hs.5123 


inorganic pyrophosphatase 


29 


72.12 


2 


Z74615 at 


Hs.1 72928 


collagen, type 1, alpha 1 


35 


108.84 


2 



Feature: Accession number on HuGeneFL array. 
Number Number of times used in cross validation. 

Testgroup: genes selected from having a high value of B/W when comparing Ta with T1 (1), Ta with T2 (2), 
andT1 withT2(3). 

Test (BAA/): To test the class separation performance of the 32 selected genes we 
compared their B/W ratios with the similar ratios of all the genes calculated from 
permutations of the arrays. For each permutation we construct three pseudogroups, 
pseudo-Ta, pseudo-T1 , and pseudo-T2, so that the proportion of samples from the three 
original groups is approximately the same in the three pseudogroups. We then calculated 
the three B/W ratios, B(Ta/T1)/W, B(Ta#T2)/W, and B(T1/T2)/W, based on the 

ta pseudogroups and selected the 32 highest values in the same way as for the actual data. 

< For the highest scoring gene among the 32 selected we found that 500 values obtained 
from the permutations have a mean value of 19.04 with the highest observed being 43.91 . 

* This should be compared to the value 108.84 from the actual data in Table 4. For the 
lowest scoring gene we found that the 500 values had a mean value of 9.69 with the 
highest being 20.55 (to be compared with 33.12 from the table). 



# 
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Web Table E: Patient disease course information - recurrence vs. no recurrence 



Group 


Patient 


Tumour (date) 


Pattern 


carcinoma in situ 


Time to recurrence 


A 


968-1 


Tagr2 


PapQIary 


no 


27 month 


A 


928-1 


Tagr2 


Papillary 


no 


38 month. 


A 


934-1 


Ta ar2 (220798) 


Papillary 


no 




A 


709-1 1 


Ta qr2 (210798) 


Papillary 


no 




A 


930-1 


Ta qr2 (300698) 


Papillary 


no 




A 


524-1 


Tagr2 (201095) 


Papillary 


no 




A 


455-1 


Ta flr2 (060695) 


Papillary 


no 




A 


370-1 


Tap.r2 (100195) 


Papillary 


no 




A 


810-1 


Tagr2 (031097) 


Papillary 


no 




A 


1148-1 


Tagr2(231199) 


Papillary 


no 




A 


1161-1 


Tagr2 (101299) 


Mixed 


no 




A 


1006-1 


Tagr2(231198) 


Papillary 


no 




A 


942-1 


Tagr2 


PapQIary 


no 


24 month. 


A 


1060-1 


Ta gr2 


Papillary 


no 


36 month. 


A 


1255-1 


Tagr2 


Papillary 


no 


24 month. 


B 


441*1 


Tao,r2 


Papillary 


no 


6 month. 


B 


780-1 


Tagr2 i 


Papillary 


no 


2 month. 


B 


815-2 


Tagr2 


Papillary 


no 


6 month. 


B 


829-1 


Taoi2 


Papillary 


no 


4 month. 


B 




I Taflf2 


Papillary 


no 


4 month. 


B 




Tagr2 


Papillary 


no 


5 month. 


B 




I Tagr2 


Papillary 


no 


5 month. » 


B 


EE31 


I TaoT2 


Papillary 


no 


6 month. 


B 


MSE3M 


Tagr2 


Papillary 


no 


8 month. 


B 


DESl 


Tagr2 


Papillary 


no 


4 month. 


B 


WZkU 


I Tam2 


Papillary 


no 


5 month. 


B 


KES51 


Tagr2 


Papillary 


no 


6 month. 


B 




Tagr2 


Papillary 


no 


5 month. 


B 




I Taor2 


Papillary 


no 


4 month. 


B 




I Tagr2 


Papillary 


no 


4 month. 


B 




I Taar2 


Papillary 


no 


5 month. ! 



Group A: Primary tumours from patients with no recurrence of the disease for 2 years. 
Group B: Primary tumours from patients with recurrence of the disease within 8 months. 
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Web Table F: Recurrence prediction results of 39 gene cross-validation loops. 



Grou 
p 


Patient 


Tumour 
(date) 


Prediction 


Error 


Prediction 
strength 


A 


968-1 


Ta gr2 


0 




0.19 


A 


928-1 


Tagr2 


0 




0.49 


A 


934-1 


Ta gr2 
(220798) 


0 




1.73 


A 


709-1 


Ta gr2 
(210798) 


0 




0.45 


A 


930-1 


Tagr2 
(300698) 


0 




0.82 


A 


524-1 


Ta gr2 
(201095) 


0 




0.14 


A 


455-1 


Ta gr2 
(060695) 


1 


* 


0.68 


A 


370-1 


Ta gr2 
(100195) 


0 




0.32 


A 


810-1 


Ta gr2 
(031097) 


0 




0.45 


A 


1146-1 


Ta gr2 
(231199) 


0 




0.98 


A 


1161-1 


Tagr2 
(101299) 


0 




0.03 


A 


1006-1 


Tagr2 
(231198) 


1 


* 


1.57 


A 


942-1 


Tagr2 


0 




0.31 


A 


1060-1 


Tagr2 


1 


* 


0.81 


A 


1255-1 


Tagr2 


1 


* 


0.71 


B 


441-1 


Tagr2 


1 




1.03 


B 


780-1 


Tagr2 


1 




0.37 


B 


815-2 


Tagr2 


1 




0.35 


B 


829-1 


Tagr2 


1 




0.75 


B 


861-1 


Tagr2 


0 


* 


2.55 


B 


925-1 


Tagr2 


1 




0.78 


B 


1008-1 


Tagr2 


0 


* 


0.12 


B 


1086-1 


Tagr2 


0 


* 


0.51 


B 


1105-1 


Tagr2 


1 




0.37 


B 


1145-1 


Tagr2 


1 




0.44 


B 


1327-1 


Tagr2 


1 




1.96 


B 


1352-1 


Tagr2 


0 


* 


0.97 


B 


1379-1 


Tagr2 


1 




0.67 


B 


533-1 


Tagr2 


1 




0.31 


B 


679-1 


Tagr2 


1 




0.82 


B 


692-1 


Tagr2 


1 




0.45 



Group A: Primary tumours from patients with no recurrence of the disease for 2 years. 
Group B: Primary tumours from patients with recurrence of the disease within 8 months. 
Prediction: 0=no recurrence, 1 recurrence. 



m . * 

Prediction strength: The relative difference between the distance to the closest and 
second closest group compared to the distance to the closest group. 



m * 



Web Table G: The 26 genes used in at least 75% (29 times) of the cross validation loops. 



Feature 


Unigene 


Description 1 




Test (W-N) 


ArUU6041 at 


HS.336916 


death-associated protein 6 


31 


0.054 (161-71 


D21337 at 


Hs.408 


coflagen t type IV. aloha 6 


31 


0.058 (160-6) 


U49oo7 at 


— 


NADP dependent teukotriene b4 12-hvdroxydehydrogenase 


31 


0.118 {313*8) 


D64154 at 


Hs.90107 


celt membrane glycoprotein. 1 10000M(r) (surface antigen) 


31 


10.078 (165-9) 


D53760 at 


Hs.8294 


KJAA0196 gene product 


31 


0.094 (159-4) ! 


□87258 at 


Hs.75111 


protease, serine. 11 (IGF binding) 


30 


0.112 (168-11) 


D87437 at 


Hs. 15087 


chromosome 1 open reading frame 16 


31 


0.058 (160-6) 


HC»i B79-HT1 91 9 at 




Ras-Uke Protein TcIO 


31 


0.122 (314-7) 


HG3076-HT3238 8 at 


- 


Heterogeneous Nuclear Ribonucleoprotein K. Aft. Splice 1 


31 


0.080(309-17) I 


HG511-HT511 at 




Ras Inhibitor Inf 


31 


0.348 (319-2) 


L34155 at 


HS.83450 


laminin. alpha 3 1 


31 


0.122(314-7) 


L3B9ZB at 


Hs.t18131 


5,10-methenyftetrahydrofolate synthetase (5- 
formyftetrahydrofolate cyclo-ligase) 


29 


0.348 (319-2) 


L49169 at 


Hs.75678 


FBJ murine osteosarcoma viral oncogene homolog B 


31 


0.108(155-2) 


M1D93B 8 at 


Hs.820 


homeo box C6 


29 


0.09(170-16) 


M03175 at 


Hs.80731 


autocrine motility factor receptor 


29 


0.098(308-18) 


MD4072 at 


Hs.1 53932 


protein tyrosine phosphatase, nonreceptor type 3 


31 


0.084 (305-31) 


wiyooZo at 


Hs.79404 


neuron-specific protein 


31 


0.122(314-7) 


U21858 at 


Hs.60679 


T A CO OKI A iwsltfmaMca II TATA knu fcain#4iv««* nmtsln /TDD\ 

1 mt 9 knh polymerase n. i f\ \ f\ dox Dinuing protein \ i br*j- 
associated factor, 32 kD 


31 


0.122(314-7) 


U45973 at 


Hs.178347 


SKIP for skeletal muscle and kidney enriched inositol 
phosphatase 


31 


0.094 (310-14) 


U58516 at 




milk fat globule-EGF factor 8 protein 


29 


0.100 (175-28) 


U62015 at 




cystetne-rlch. angiogenic Inducer. 61 


31 


0.108(169-13) 


U66703 at 




protein tyrosine phosphatase, receptor type. N polypeptide 2 


31 


0.146(149-1) 


U70439 8 at 




acidic protein rich in teuclnes 


30 


0.08 (309-17) 


U94855 at 




eukaryotlc translation Initiation factor 3. subunft 5 


30 


0.092 (311-12) 


X63469 at 




genera] transcription factor HE. polypeptide 2 


31 


0.092(311-12) 


223064 at 


I Hs.146381 


I RNA binding motif protein. X chromosome 




0.066(307-24) 



Feature: Accession number on HuGeneFL array. 

Number: Number of times the gene has been used in a cross-validation loop. 
Test The numbers in parenthesis are the value W of the Wilcoxon test statistic for no 
difference between the two groups together with the number N of genes for which the 
Wilcoxon test statistic is bigger than or equal to the value W. The test value is obtained 
from 500 permutations of the arrays. In each permutation we form new pseudogroups 
where both of the pseudogroups have the same proportion of arrays from the two original 
groups. For each permutation we count the number of genes for which the Wilcoxon test 
statistic based on the pseudogroups is bigger than or equal to W, and the test value is the 
proportion of the permutations for which this number is bigger than or equal to N. Thus the 
test value measures the significance of the observed value W. Consequently, for most of 
our selected genes we only find as least as good predictive genes in about 10% of the 
formed pseudogroups. 




83 



This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 



Defective images within this document are accurate representations of the original 
documents submitted by the applicant. 

Defects in the images include but are not limited to the items checked: 

□ BLACK BORDERS 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 

□ FADED TEXT OR DRAWING 



Id LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 



IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



BEST AVAILABLE IMAGES 




BLURRED OR ILLEGIBLE TEXT OR DRAWING 



□ SKEWED/SLANTED IMAGES 



□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 



□ GRAY SCALE DOCUMENTS 




